▲ 12 r/DuckDB+1 crossposts

Replacing an Impala cluster with DuckDB pods for a legacy analytics application - looking for architecture feedback

Looking for feedback from people who've worked with analytical databases (Impala, DuckDB, ClickHouse, Trino, etc.).

We have a legacy reporting application where users generate presentations. Opening a presentation triggers 50-100 SQL queries. The application is in maintenance mode with only one major paying customer, so our goal is to simplify the architecture, remove Cloudera licensing for Impala, and significantly reduce infrastructure costs.

Current Architecture

Presentation
      |
20 Dataset Worker Pods
      |
   Impala Cluster
(10 different EC2 r5.4xlarge with 128GB ram each)

The dataset worker pods simply receive tasks from the application and submit SQL to Impala.

The Impala cluster consists of 10 x r5.4xlarge EC2 instances (16 vCPUs, 128 GB RAM each) managed through Cloudera.

Workload

The workload isn't typical OLAP.

Each presentation fires 50-100 queries.

Roughly:

~80% are tiny queries
- schema lookups
- small dimension table filters
- simple joins
These usually return in 5-10 ms on Impala.
Around 5-10% are heavier joins that take around 10 seconds.
A presentation typically loads in 1-3 minutes depending upon type and filters

The total warehouse size is only around 300-350 GB.

Only 3-4 large tables account for roughly 200 GB. The remaining ~200 tables are tiny (KBs to MBs).

We want to Migrate away from Impala and not go for big commitment like dedicated EMR or something, we are ok with little delay but we dont want huge maintenance so we started with migrating to Athena from Impala.

Why Athena didn't work

Our first migration idea was Athena.

Large queries were acceptable, but the application performance became much worse because of the large number of tiny queries.

Queries that took 5-10 ms on Impala often became 200-800 ms on Athena.

Since every presentation executes 50-100 queries, that startup overhead adds up quickly.

Unfortunately, changing the application isn't really an option. The query generation is deeply embedded in legacy code, so batching or combining queries would require a major rewrite. Also many queries are sequential that adds up the time.

DuckDB Prototype

Instead of introducing another distributed SQL engine, I built a proof of concept using DuckDB.

Current architecture:

Presentation
       |
20 Dataset Worker Pods
       |
      HTTP
       |
---------------------------------
| DuckDB Pod 1                  |
| DuckDB Pod 2                  |
| DuckDB Pod 3                  |
| DuckDB Pod 4                  |
| DuckDB Pod 5                  |
---------------------------------

Each DuckDB pod:

has its own DuckDB .db file
has its own dedicated EBS volume
serves requests over HTTP
operates completely independently (no distributed execution)

The dataset worker pods simply load balance requests across the DuckDB pods.

The workload is almost entirely read-only.

For the few workflows that create temporary tables, I'm considering running a separate DuckDB write service with its own EBS volume since those temp tables only exist for the lifetime of a request.

Results

So far the prototype performs better than Athena for presentation loading, but still not as fast as Impala.

That isn't too surprising since the existing Impala deployment is heavily provisioned (10 × 128 GB RAM nodes) for only ~300-350 GB of data.

For this application, we're willing to accept somewhat slower presentation loads if it significantly reduces operational complexity, infrastructure cost, and removes the Cloudera dependency.

One thing I'm also thinking about

Right now every DuckDB pod has its own copy of the .db file on its own EBS volume.

Would you keep this design, or would you use something like a high-throughput EFS shared across all DuckDB pods?

I ruled out reading directly from S3 because this workload is dominated by lots of tiny, latency-sensitive queries rather than long analytical scans, and the additional object storage latency seemed noticeable during testing.

Questions

Has anyone replaced Impala with DuckDB for a similar workload?
Am I overlooking any major architectural issues with multiple independent DuckDB replicas?
Would you keep one .db file per pod on dedicated EBS, or use shared storage like EFS?
Would you choose a different engine entirely (ClickHouse, Trino, StarRocks, etc.) for this workload?
Any concurrency or operational issues you've run into serving DuckDB over HTTP in production?

I'm less interested in benchmark numbers and more interested in hearing from people who've operated similar systems in production

reddit.com

u/bhavay22 — 1 day ago

▲ 24 r/DuckDB+1 crossposts

DuckDB Basics: Reading and Importing Data

https://thefulldatastack.substack.com/p/duckdb-basics-importing-data

u/empty_cities — 2 days ago

▲ 4 r/DuckDB+1 crossposts

Rosetta DBT Studio v1.5.7: SQL, lineage, AI, and Git in one desktop app

Rosetta DBT Studio is the local-first, open-source desktop workspace for dbt teams - build models, explore SQL, trace lineage, manage Git, preview cloud data, and work with AI agents from one app.

v1.5.7 is all about giving analytics engineers one place to build, understand, and ship dbt work without bouncing between terminals, browsers, and disconnected tools.

AI Agent for dbt - ask the agent to explain models, generate SQL, update YAML, run dbt commands, inspect project context, and help reason through lineage and grain before you make a change.

Project-aware SQL editor - write queries with Monaco autocomplete, schema browsing, saved queries, result previews, and export workflows across DuckDB and supported warehouses.

End-to-end dbt lineage - trace models, sources, and downstream impact so you can understand what a change touches before it reaches production.

SQL Notebooks with AI - work in a cell-based notebook that combines SQL, markdown, results, and AI-assisted iteration in one analysis flow.

Cloud Explorer with DuckDB-powered preview - browse and preview files from S3, Azure Blob, GCS, and compatible object storage without leaving the Studio.

DuckLake and lakehouse tooling - explore schemas, tables, snapshots, and metadata from one workspace, then query them directly in the SQL editor and notebooks.

Built-in Git workflows - review local changes, manage branches, and keep project version control close to the editor instead of split across tools.

Multi-provider AI - use OpenAI, Anthropic, Gemini, Ollama, LM Studio, and OpenAI-compatible providers with project context.

Plus secure credential storage, multi-database connections, Rosetta CLI integration, cloud profile sync, and a full desktop workflow for local dbt development.

100% open source, local-first, and yours.

GitHub - https://github.com/rosettadb/dbt-studio

u/Wide_Importance_8559 — 5 days ago

▲ 52 r/DuckDB+2 crossposts

Serious Data Engineering on a seriously tight budget

Glad to join this community and that I am allowed 1 self promotion post 😀
In my spare time I developed this project, using Open Source tooling. This ‘modern data stack’ uses DuckDB, DuckLake, Dagster, dlt and Metabase with a relatively advanced SCD2 handling (including deletes) in the ‘Silver’ layer. Is this unique? Surely not, but I learned a lot building it. Maybe someone can use it, or help me improve it.

github.com

u/EdwinWeber_Data — 9 days ago

▲ 24 r/DuckDB

Tips and Tricks you wish you knew when you started with DuckDB

Hey guys, I'm working on a project with the eventual goal of having a CLI command that ingests messy JSON/JSONL files and turns them into Parquet tables, and makes those tables easy to query with DuckDB. I was hoping people more experienced in DuckDB and maybe databases in general could offer me some advice as someone getting started with a project like this.

I really appreciate anyone that takes time to respond, and if you don't and just read it, thank you anyways 🙏

reddit.com

u/WhereTheStankWindBlo — 9 days ago

▲ 60 r/DuckDB+5 crossposts

You don't know XPT files

https://kolistat.com/blog/xpt-files/

u/caerbannogwhite — 14 days ago

▲ 17 r/DuckDB

Understanding DuckLake's Sorted Tables Feature

https://thefulldatastack.substack.com/p/understanding-ducklakes-sorted-tables A sponsored post about DuckLake's Sorted Tables feature. It allows you to specify a sort config for a table so that unsorted data will automatically get sorted in a certain way with inserts, flushing and compaction.

For queries ran regularly on high cardinality columns like id or timestamp this can optimize reads. When data is sorted physically in Parquets it allows for both file skip and row group skip to only get the data you need for the query (a.k.a predicate pushdown).

I made a high level mental model image here I thought came out well to explain the file skip and predicate pushdown (row group skip).

https://preview.redd.it/o9lydy5tpb9h1.png?width=2188&format=png&auto=webp&s=3a3443f4c1e8a7e2059bd40da0c2fd76486e668f

reddit.com

u/empty_cities — 11 days ago

▲ 14 r/DuckDB

We tested the same text-to-SQL model with and without business definitions

One thing that surprised me while working on text-to-SQL systems is that schema awareness and business awareness are very different problems.

A model can usually see tables, columns, and join keys just fine. What it often doesn’t know is what counts as a customer, when an order becomes revenue, or which business rules were never written into the schema.

We ran the same model against the same data in three setups:

Raw data only: ~20% accuracy
Canonical model only: ~75%
Canonical model + meaning layer: 95%+

The failures weren't SQL failures, the model generated valid SQL most of the time. It was answering the wrong business question because it didn't understand the meaning behind the tables.

Anthropic recently described a similar internal analytics pattern, which suggests the same architectural pressure is pushing different teams toward the same mapping-first approach.

Curious if others building text-to-SQL on DuckDB have seen the same thing, has schema context been enough, or did you eventually need a semantic layer / ontology too?

reddit.com

u/Thinker_Assignment — 14 days ago

▲ 20 r/DuckDB+1 crossposts

Event Notes: DuckCon #7 - Amsterdam

ssp.sh

u/sspaeti — 12 days ago

▲ 8 r/DuckDB

Is it a good idea that I use DuckDB on top of both Postgres and ClickHouse together with dbt and then write to a separate read-only db for BI & LLM to query from?

Hey there! A bit of background on myself first: I have been working in the field of data and analytics for over 7 years, but started as an analyst, and then gradually transitioned into an analytics engineer and now a data engineer. I don't have much hands-on knowledge and experience in building the data infra. I have mostly worked with BigQuery, dbt, and data ingestion tools like Fivetran.

I just started in a new company and there is a need for me to rebuild the data infra. I am the only data engineer in a medium-sized company. The company is self-hosting, and they are very determined on that. We use Postgres for operational transactional data, and then we have a replica of that, also a Postgres dwh, for analytics usage. We also have ClickHouse which currently only stores events data and is not being consumed.

After some researching and reading, I wonder if the below architecture will be a solid setup moving forward? Any better ideas? I really appreciate any of your help and advice! We do not want to move to any cloud-based managed data warehouse. We also do not want any third-parties to read directly into our Postgres nor our replica Postgres.

Thank you very much!

Postgres ──┐

├──→ DuckDB server (dedicated machine, NVMe)

ClickHouse ──┘ │

dbt runs here

Quack serves here

│

BI tool (read-only, modelled data only)

reddit.com

u/Novel-Information776 — 13 days ago

r/DuckDB

Replacing an Impala cluster with DuckDB pods for a legacy analytics application - looking for architecture feedback

Current Architecture

Workload

Why Athena didn't work

DuckDB Prototype

Results

One thing I'm also thinking about

Questions

DuckDB Basics: Reading and Importing Data

Rosetta DBT Studio v1.5.7: SQL, lineage, AI, and Git in one desktop app

Serious Data Engineering on a seriously tight budget

Tips and Tricks you wish you knew when you started with DuckDB

You don't know XPT files

Understanding DuckLake's Sorted Tables Feature

We tested the same text-to-SQL model with and without business definitions

Event Notes: DuckCon #7 - Amsterdam

Is it a good idea that I use DuckDB on top of both Postgres and ClickHouse together with dbt and then write to a separate read-only db for BI &amp; LLM to query from?

Is it a good idea that I use DuckDB on top of both Postgres and ClickHouse together with dbt and then write to a separate read-only db for BI & LLM to query from?