$ lexprog.com

// notes from an old coder -- php, databases, and the occasional rant

[June 25, 2024] ClickHouse

ClickHouse FINAL Rows: Deduplication

ClickHouse FINAL Rows: Deduplication

────────────────────────────────────────────────────────

ClickHouse FINAL Rows: Deduplication

Tip: FINAL Modifier

SELECT * FROM users FINAL WHERE id = 1;

Forces deduplication across all parts. Returns the latest version.

Gotcha: FINAL is Slow

FINAL reads and merges all parts. Avoid on large tables — use ReplacingMergeTree with background merges instead.

Tip: ReplacingMergeTree

CREATE TABLE users (
    id UInt64,
    name String,
    version UInt64
) ENGINE = ReplacingMergeTree(version)
ORDER BY id;

Keeps only the row with the highest version per key.

Gotcha: Deduplication Happens During Merge

Not immediately after insert. Querying right after insert may return duplicates.

Tip: OPTIMIZE TABLE FINAL

OPTIMIZE TABLE users FINAL;

Forces an immediate merge. Expensive — use sparingly.

Gotcha: CollapsingMergeTree

For counters that can go negative. Pairs of (+row, -row) collapse during merge.

Tip: Order of Columns in ORDER BY Matters Massively

ClickHouse's primary key is defined by ORDER BY. Put high-cardinality columns first for better data skipping. ORDER BY (timestamp, user_id) is very different from ORDER BY (user_id, timestamp) in query performance.

Tip: Use LowCardinality for Enum-Like Strings

Strings like status, country, browser benefit from LowCardinality(String) — it's stored as a dictionary internally, reducing storage 10x and speeding up scans.

Gotcha: Mutations Are Heavy

ALTER TABLE ... UPDATE and DELETE in ClickHouse create new parts instead of modifying in place. A single mutation on a large table can take hours and block merges. Design for append-only from day one.

Senior Insight

The FINAL modifier in SELECT queries returns deduplicated results for ReplacingMergeTree tables. I use ReplacingMergeTree for data that may have duplicate rows (from retries or upserts). The catch: FINAL adds a significant performance cost because ClickHouse must collapse rows during the query. For tables under 10 million rows, FINAL is acceptable. Beyond that, I design queries to avoid FINAL by filtering on the version column explicitly.

Source: ClickHouse Blog (https://clickhouse.com/blog), Altinity Blog (https://altinity.com/blog), Altinity Knowledge Base (https://kb.altinity.com/)

────────────────────────────────────────────────────────
<-- back to posts