$ lexprog.com

// notes from an old coder -- php, databases, and the occasional rant

[June 27, 2024] ClickHouse

ClickHouse Data Compression

ClickHouse Data Compression

────────────────────────────────────────────────────────

ClickHouse Data Compression

Tip: LZ4 Codec

CREATE TABLE events (
    id UInt64,
    data String CODEC(LZ4)
) ENGINE = MergeTree ORDER BY id;

Default codec. Fast compression.

Gotcha: ZSTD for Better Ratio

CODEC(ZSTD(3))

Better compression than LZ4 but slower.

Tip: Delta Codec for Numbers

CODEC(Delta, ZSTD)

Delta encodes differences between consecutive values.

Gotcha: None Codec

CODEC(NONE)

No compression. Useful for already-compressed data.

Tip: Compression Ratio

ClickHouse typically achieves 3-10x compression. Text data compresses best.

Gotcha: CPU Trade-off

Better compression = more CPU usage during insert and query.

Tip: Order of Columns in ORDER BY Matters Massively

ClickHouse's primary key is defined by ORDER BY. Put high-cardinality columns first for better data skipping. ORDER BY (timestamp, user_id) is very different from ORDER BY (user_id, timestamp) in query performance.

Tip: Use LowCardinality for Enum-Like Strings

Strings like status, country, browser benefit from LowCardinality(String) — it's stored as a dictionary internally, reducing storage 10x and speeding up scans.

Gotcha: Mutations Are Heavy

ALTER TABLE ... UPDATE and DELETE in ClickHouse create new parts instead of modifying in place. A single mutation on a large table can take hours and block merges. Design for append-only from day one.

Senior Insight

ClickHouse's compression is where its columnar storage truly shines. I've seen compression ratios of 5:1 to 15:1 compared to the same data in PostgreSQL. The default LZ4 compression balances speed and ratio. For archival data, I use ZSTD with compression level 10+ for maximum compression. The columnar storage means each column is compressed independently — string columns compress well, UUID columns don't (they're essentially random). Choosing efficient data types directly impacts storage costs.

Source: ClickHouse Blog (https://clickhouse.com/blog), Altinity Blog (https://altinity.com/blog), Altinity Knowledge Base (https://kb.altinity.com/)

────────────────────────────────────────────────────────
<-- back to posts