$ lexprog.com

// notes from an old coder -- php, databases, and the occasional rant

[December 06, 2025] ClickHouse

ClickHouse MergeTree: Optimization Tips

ClickHouse MergeTree: Optimization Tips

────────────────────────────────────────────────────────

ClickHouse MergeTree: Optimization Tips

Tip: ORDER BY Defines Data Layout

ORDER BY (created_at, event_type)

Data is physically sorted by these columns. Queries filtering on the first column are fastest.

Gotcha: Primary Key = Sorting Key

In ClickHouse, ORDER BY defines both. Choose the most selective column first.

Tip: Partition by Month

PARTITION BY toYYYYMM(created_at)

Makes it easy to drop old data: ALTER TABLE events DROP PARTITION 202401.

Gotcha: Too Many Partitions Slow Down Queries

Each partition adds overhead. Aim for dozens, not thousands, of partitions.

Tip: SAMPLE BY for Approximate Queries

ORDER BY (user_id, created_at)
SAMPLE BY user_id

Enables SAMPLE 0.1 for 10% sampling.

Gotcha: TTL for Auto-Deletion

TTL created_at + INTERVAL 90 DAY DELETE

Automatically removes old data during merges.

Tip: Order of Columns in ORDER BY Matters Massively

ClickHouse's primary key is defined by ORDER BY. Put high-cardinality columns first for better data skipping. ORDER BY (timestamp, user_id) is very different from ORDER BY (user_id, timestamp) in query performance.

Tip: Use LowCardinality for Enum-Like Strings

Strings like status, country, browser benefit from LowCardinality(String) — it's stored as a dictionary internally, reducing storage 10x and speeding up scans.

Gotcha: Mutations Are Heavy

ALTER TABLE ... UPDATE and DELETE in ClickHouse create new parts instead of modifying in place. A single mutation on a large table can take hours and block merges. Design for append-only from day one.

Senior Insight

I've adopted a strict separation between ClickHouse clusters for different data temperatures. Hot data (7 days) goes on fast NVMe storage with replication. Warm data (90 days) goes on SSD without replication. Cold data (beyond 90 days) goes to S3 with the S3 table engine. This tiered approach optimizes cost without sacrificing query performance — the hot cluster handles real-time dashboards, while historical analytics queries gracefully tolerate the slower S3 access.

Source: ClickHouse Blog (https://clickhouse.com/blog), Altinity Blog (https://altinity.com/blog), Altinity Knowledge Base (https://kb.altinity.com/)

────────────────────────────────────────────────────────
<-- back to posts