$ lexprog.com

// notes from an old coder -- php, databases, and the occasional rant

[December 24, 2025] ClickHouse

ClickHouse + Kafka: Streaming Data

ClickHouse + Kafka: Streaming Data

────────────────────────────────────────────────────────

ClickHouse + Kafka: Streaming Data

Tip: Kafka Engine Table

CREATE TABLE kafka_views ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'localhost:9092',
    kafka_topic_list = 'page_views',
    kafka_group_name = 'clickhouse',
    kafka_format = 'JSONEachRow';

Gotcha: Kafka Table is Read-Only

You can't INSERT into a Kafka engine table. It only consumes messages.

Tip: Materialized View from Kafka

CREATE MATERIALIZED VIEW page_views_mv
TO page_views
AS SELECT * FROM kafka_views;

Moves data from Kafka to a persistent table.

Gotcha: Message Format Must Match

JSONEachRow expects one JSON object per line. Mismatched format causes silent failures.

Tip: Multiple Topics

kafka_topic_list = 'views,clicks,conversions'

Consume from multiple topics with one table.

Gotcha: Offset Management

ClickHouse tracks offsets internally. Restarting ClickHouse resumes from the last committed offset.

Tip: Order of Columns in ORDER BY Matters Massively

ClickHouse's primary key is defined by ORDER BY. Put high-cardinality columns first for better data skipping. ORDER BY (timestamp, user_id) is very different from ORDER BY (user_id, timestamp) in query performance.

Tip: Use LowCardinality for Enum-Like Strings

Strings like status, country, browser benefit from LowCardinality(String) — it's stored as a dictionary internally, reducing storage 10x and speeding up scans.

Gotcha: Mutations Are Heavy

ALTER TABLE ... UPDATE and DELETE in ClickHouse create new parts instead of modifying in place. A single mutation on a large table can take hours and block merges. Design for append-only from day one.

Senior Insight

The Kafka engine table in ClickHouse consumes Kafka messages directly. I've set up streaming analytics where Kafka → ClickHouse processing happens in real-time with sub-second latency. The configuration requires matching the Kafka message format to the ClickHouse table schema. The gotcha: if ClickHouse can't parse a Kafka message (schema mismatch), the entire consumer can stall. I always use a raw Kafka table that stores messages as strings, with a materialized view that parses and transforms them.

Source: ClickHouse Blog (https://clickhouse.com/blog), Altinity Blog (https://altinity.com/blog), Altinity Knowledge Base (https://kb.altinity.com/)

────────────────────────────────────────────────────────
<-- back to posts