$ lexprog.com

// notes from an old coder -- php, databases, and the occasional rant

[July 23, 2024] ClickHouse

ClickHouse Distributed Tables: Sharding

ClickHouse Distributed Tables: Sharding

────────────────────────────────────────────────────────

ClickHouse Distributed Tables: Sharding

Tip: Distributed Table Engine

CREATE TABLE distributed_events AS events
ENGINE = Distributed(cluster_name, default, events, rand());

Shards data across cluster nodes using rand().

Gotcha: Sharding Key Matters

rand() distributes evenly but doesn't co-locate related data. Use cityHash64(user_id) for user-centric sharding.

Tip: Global Query

SELECT count() FROM distributed_events;

Queries all shards automatically and merges results.

Gotcha: Distributed Joins Are Expensive

Joins across distributed tables require data transfer between nodes. Denormalize when possible.

Tip: Local vs Distributed

SELECT count() FROM events;
SELECT count() FROM distributed_events;

Gotcha: ReplicatedMergeTree for HA

Use ReplicatedMergeTree on each shard for high availability.

Tip: Order of Columns in ORDER BY Matters Massively

ClickHouse's primary key is defined by ORDER BY. Put high-cardinality columns first for better data skipping. ORDER BY (timestamp, user_id) is very different from ORDER BY (user_id, timestamp) in query performance.

Tip: Use LowCardinality for Enum-Like Strings

Strings like status, country, browser benefit from LowCardinality(String) — it's stored as a dictionary internally, reducing storage 10x and speeding up scans.

Gotcha: Mutations Are Heavy

ALTER TABLE ... UPDATE and DELETE in ClickHouse create new parts instead of modifying in place. A single mutation on a large table can take hours and block merges. Design for append-only from day one.

Senior Insight

Distributed tables in ClickHouse enable querying across multiple servers as if they were one. I've set up sharded ClickHouse clusters for applications processing billions of events daily. The key insight: distributed queries are slower than local queries because of the network overhead. I design data locality so most queries hit a single shard. Also, prefer_localhost_replica = 1 ensures one shard is local to the querying server, avoiding unnecessary network hops.

Source: ClickHouse Blog (https://clickhouse.com/blog), Altinity Blog (https://altinity.com/blog), Altinity Knowledge Base (https://kb.altinity.com/)

────────────────────────────────────────────────────────
<-- back to posts