u/Individual-Show7812

ok this has been bugging me for a while.

we run a busy clickhouse cluster, 100k+ queries every few hours. when something slows down, the usual first move is to check query_log sorted by memory_usage desc - find the biggest queries, see what they did.

except our UI was lying. you'd sort by memory and the top row would say "38GB", looking like the obvious culprit. except 38GB wasn't the biggest query in the window at all. the biggest was 60GB, 4 hours ago. the UI had just fetched the latest 1000 rows by event_time and sorted THOSE client-side. so sort by memory = "the most memory-heavy of whatever arbitrary slice we happened to grab". never told you that.

took a while to even realize. then a longer while to stop wanting to throw the laptop.

anyway, we already had an internal clickhouse UI (originally built by Muhammad Rizal - he wrote the RBAC architecture, encrypted credentials, SQL workspace, all the actual production-ready bones), so i extended it. spent the last few weeks adding the monitoring layer i wished existed.

stuff that's in v2.14:

sort actually goes to the SQL. pick memory desc, ORDER BY changes, CH returns the genuine top-N for the time window. obvious in hindsight.
patterns view. normalizeQuery() rolls up SELECT ... WHERE id=42 and WHERE id=43 etc into one row with the cumulative cost. on ETL/redash workloads where the same template runs 5000 times, this is what actually tells you where to optimize. avg duration of one query lies, total wall-clock across all repetitions doesn't.
by-table view. arrayJoin(tables), grouped by hot table. first version was slow as hell - 30s+ on the busy cluster - because it was exploding every system.* table touch into rows then post-filtering. moved the filter into arrayFilter() before the join, ~1s now. classic clickhouse "push the predicate down or die" thing.
histogram. distribution of duration/memory/rows/bytes. found out our workload is super long-tail (99.97% of queries finish under 50ms, the rest is the entire tail) which i suspected but never visualized.
schema doctor. lints parts_columns for Nullable() that's never actually null + Int64 columns that fit in Int32. ranked by on-disk bytes so you see the biggest wins first. found like 200GB of pointless Nullable wrappers on tables nobody touches.
memory pressure flag. every row gets compared to OSMemoryTotal from asynchronous_metrics. >25% = red, >10% = amber. originally i set the threshold at 40% but then realized 4 queries at 20% each + background merges already gets you to OOM territory, so pulled it down. still tuning honestly.
cluster activity tab. system.mutations + system.replication_queue with status chips. sorted by num_tries desc so the sick replicas bubble up.
side-by-side query compare (pick 2 rows, get a diff with emerald/red tints), profile events drill-down (lazy-fetches the ProfileEvents map), views-triggered drill-down (query_views_log by initial_query_id - auto-hides if your cluster doesn't have it enabled at server level, which fwiw most don't).

monitoring layout is heavily inspired by QueryDog by Benjamin Wootton (Elastic 2.0). chart-plus-table model, several of the SQL aggregation patterns. read his source, liked the ideas, rewrote everything from scratch on React 19 / shadcn-ui / recharts on top of our existing editorial design system. no QueryDog code bundled but credit where it's due.

vibe-coded a lot of it with claude code. that's how the monitoring layer went from "would be nice" to merged in days instead of months. every SQL and every threshold was reviewed and tested against the real busy cluster though, not shipped blind. take that as you will.

stack: react 19 + bun/hono backend, clickhouse-js client, aes-256-gcm for stored connection passwords, argon2id for user passwords, sqlite or postgres for RBAC. apache 2.0.

demo: https://chouse-ui.com
source: https://github.com/daun-gatal/chouse-ui

PRs / issues / "this would be useful if it did X" all welcome. small team, community input is what moves it.

Sort by "memory desc" in ClickHouse query_log was lying to us for months. Open-sourced the rebuild.