$ lexprog.com

// notes from an old coder -- php, databases, and the occasional rant

[September 07, 2025] ClickHouse

ClickHouse Data Insertion at Scale

ClickHouse Data Insertion: Tips & Tricks

────────────────────────────────────────────────────────

ClickHouse Data Insertion: Tips & Tricks

Tip: Batch Inserts of 10K-100K Rows

$data = [];
for ($i = 0; $i < 10000; $i++) {
    $data[] = [
        'id' => $i,
        'url' => '/page/' . $i,
        'user_id' => rand(1, 1000),
        'created_at' => now(),
    ];
}
DB::connection('clickhouse')->table('page_views')->insert($data);

Optimal batch size: 10K-100K rows.

Gotcha: Don't Insert One Row at a Time

Single-row inserts are extremely inefficient in ClickHouse. Always batch.

Tip: Async Inserts

$client->addSendAsyncInterval(1000);

Buffers inserts and sends them periodically.

Gotcha: Insert Format Matters

CSV is faster than JSON for bulk inserts:

INSERT INTO page_views FROM INFILE '/data/views.csv' FORMAT CSV;

Tip: Kafka Integration

CREATE TABLE kafka_views ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'localhost:9092',
    kafka_topic_list = 'page_views',
    kafka_group_name = 'clickhouse',
    kafka_format = 'JSONEachRow';

Real-time ingestion from Kafka.

Gotcha: Insert Deduplication

ClickHouse may deduplicate identical inserts within a short time window. Add a unique column if needed.

Tip: Materialized Views for Auto-Insert

CREATE MATERIALIZED VIEW daily_views
ENGINE = SummingMergeTree()
ORDER BY (date, url)
AS SELECT toDate(created_at) as date, url, count() as views
FROM page_views GROUP BY date, url;

Automatically populates from the source table.

Gotcha: Insert Blocks

ClickHouse groups inserts into "parts." Too many small inserts create too many parts, slowing down merges.

Tip: Order of Columns in ORDER BY Matters Massively

ClickHouse's primary key is defined by ORDER BY. Put high-cardinality columns first for better data skipping. ORDER BY (timestamp, user_id) is very different from ORDER BY (user_id, timestamp) in query performance.

Tip: Use LowCardinality for Enum-Like Strings

Strings like status, country, browser benefit from LowCardinality(String) — it's stored as a dictionary internally, reducing storage 10x and speeding up scans.

Gotcha: Mutations Are Heavy

ALTER TABLE ... UPDATE and DELETE in ClickHouse create new parts instead of modifying in place. A single mutation on a large table can take hours and block merges. Design for append-only from day one.

Senior Insight

ClickHouse is optimized for bulk inserts, not single-row operations. Inserting one row at a time creates tiny parts that overwhelm the merge process. I always buffer to at least 100K rows or insert every 5 seconds (whichever comes first). The Async Insert mode (ClickHouse 22.4+) buffers inserts automatically — I enable it for data ingestion pipelines. The throughput difference is dramatic: 100 single inserts/second vs 100K rows/second in bulk mode.

Source: ClickHouse Blog (https://clickhouse.com/blog), Altinity Blog (https://altinity.com/blog), Altinity Knowledge Base (https://kb.altinity.com/)

────────────────────────────────────────────────────────
<-- back to posts