Hi r/analytics,
If you evaluated ClickHouse a few years ago, you were likely told to avoid JOINs at all costs. The standard "tribal knowledge" was to denormalize everything into massive, flat tables.
In 2020, that criticism was fair. ClickHouse had a single hash join algorithm. If your right-side table exceeded memory, the query just crashed.
That advice is now officially outdated. I recently dug into the commit history. Between 2022 and 2026, the engineering team merged over 50 pull requests that dismantled almost every limitation of the join engine. ClickHouse now has the planning sophistication of a mature RDBMS operating inside a vectorized model.
I mapped out the most impactful changes shipping by default today:
- Grace Hash Join: Inactive buckets now spill to disk. OOM crashes for memory-bound joins are completely solved.
- Equivalence-set Filter Pushdown: Filters now propagate across join sides automatically to prune data before it reaches the join.
- Global Join Reordering: It finally uses column statistics and dynamic programming to find the optimal join order automatically.
- Runtime Bloom Filters: Dynamically prunes probe-side data at the storage scan level.
- Parallel Hash Join: This scales linearly across CPU cores and is now the default execution mode.
The Verdict:
Denormalization still makes sense if you have hyper-strict sub-10ms p99 latency targets. But for standard star schema workloads, you no longer need to flatten your data.
Are you still building wide flat tables out of habit/fear, or have you started moving back to normalized joins in ClickHouse?