$ lexprog.com

// notes from an old coder -- php, databases, and the occasional rant

[May 05, 2025] ClickHouse

ClickHouse Monitoring and Maintenance

ClickHouse Monitoring: Tips & Tricks

────────────────────────────────────────────────────────

ClickHouse Monitoring: Tips & Tricks

Tip: System Tables

SELECT * FROM system.query_log ORDER BY event_time DESC LIMIT 10;

Shows recent queries with execution time.

Gotcha: system.parts for Storage

SELECT table, partition, bytes_on_disk FROM system.parts WHERE active;

Shows how much space each partition uses.

Tip: Monitor Merges

SELECT * FROM system.merges;

Shows ongoing merge operations. Too many merges = too many small inserts.

Gotcha: Mutations

SELECT * FROM system.mutations WHERE is_done = 0;

Tracks ongoing ALTER/DELETE operations. ClickHouse mutations are asynchronous.

Tip: Backups

BACKUP TABLE page_views TO Disk('backups', 'backup_2024.zip');
RESTORE TABLE page_views FROM Disk('backups', 'backup_2024.zip');

Native backup and restore.

Gotcha: Disk Space Monitoring

ClickHouse doesn't auto-clean. Set up alerts for disk usage.

Tip: Query Log Retention

SET query_log_retention_time = 7 * 24 * 3600;

Keep query logs for 7 days.

Gotcha: system.errors Table

SELECT * FROM system.errors ORDER BY event_time DESC LIMIT 10;

Shows recent errors across the cluster.

Tip: Grafana + Prometheus

Use the built-in Prometheus metrics endpoint:

http://clickhouse:9363/metrics

For comprehensive monitoring dashboards.

Tip: Order of Columns in ORDER BY Matters Massively

ClickHouse's primary key is defined by ORDER BY. Put high-cardinality columns first for better data skipping. ORDER BY (timestamp, user_id) is very different from ORDER BY (user_id, timestamp) in query performance.

Tip: Use LowCardinality for Enum-Like Strings

Strings like status, country, browser benefit from LowCardinality(String) — it's stored as a dictionary internally, reducing storage 10x and speeding up scans.

Gotcha: Mutations Are Heavy

ALTER TABLE ... UPDATE and DELETE in ClickHouse create new parts instead of modifying in place. A single mutation on a large table can take hours and block merges. Design for append-only from day one.

Senior Insight

ClickHouse system tables are my first stop for debugging. system.query_log shows every query executed — I use it to find slow queries and high-frequency patterns. system.parts shows table part counts — a table with thousands of tiny parts needs optimizing (OPTIMIZE TABLE ... FINAL). system.merges shows active merge operations. I've automated ClickHouse monitoring with Grafana dashboards that track merge queue depth, query latency, and disk usage.

Source: ClickHouse Blog (https://clickhouse.com/blog), Altinity Blog (https://altinity.com/blog), Altinity Knowledge Base (https://kb.altinity.com/)

────────────────────────────────────────────────────────
<-- back to posts