$ lexprog.com

// notes from an old coder -- php, databases, and the occasional rant

[July 07, 2025] ClickHouse

ClickHouse ReplacingMergeTree: Deduplication

ClickHouse ReplacingMergeTree: Deduplication

────────────────────────────────────────────────────────

ClickHouse ReplacingMergeTree: Deduplication

Tip: Version Column

CREATE TABLE users (
    id UInt64,
    name String,
    version UInt64
) ENGINE = ReplacingMergeTree(version)
ORDER BY id;

Gotcha: Dedup During Merge

Deduplication happens during background merges, not on insert.

Tip: Without Version

ENGINE = ReplacingMergeTree()

Keeps the last inserted row per key.

Gotcha: FINAL for Consistent Reads

SELECT * FROM users FINAL WHERE id = 1;

Forces deduplication at query time.

Tip: OPTIMIZE TABLE

OPTIMIZE TABLE users FINAL;

Forces immediate merge and dedup.

Gotcha: Not for Real-Time

Don't rely on ReplacingMergeTree for real-time dedup. Use it for eventual consistency.

Tip: Order of Columns in ORDER BY Matters Massively

ClickHouse's primary key is defined by ORDER BY. Put high-cardinality columns first for better data skipping. ORDER BY (timestamp, user_id) is very different from ORDER BY (user_id, timestamp) in query performance.

Tip: Use LowCardinality for Enum-Like Strings

Strings like status, country, browser benefit from LowCardinality(String) — it's stored as a dictionary internally, reducing storage 10x and speeding up scans.

Gotcha: Mutations Are Heavy

ALTER TABLE ... UPDATE and DELETE in ClickHouse create new parts instead of modifying in place. A single mutation on a large table can take hours and block merges. Design for append-only from day one.

Senior Insight

ReplacingMergeTree is ClickHouse's answer to upsert operations. It deduplicates rows with the same sorting key during merges, keeping only the last row (by version column or insertion time). I use it for maintaining the latest state of frequently updated entities. The critical note: deduplication only happens during merges, which are asynchronous. Immediate SELECT queries may return duplicates. Use FINAL or aggregate with argMax() for consistent results.

Source: ClickHouse Blog (https://clickhouse.com/blog), Altinity Blog (https://altinity.com/blog), Altinity Knowledge Base (https://kb.altinity.com/)

────────────────────────────────────────────────────────
<-- back to posts