ClickHouse Monitoring and Maintenance
ClickHouse Monitoring: Tips & Tricks
ClickHouse Monitoring: Tips & Tricks
Tip: System Tables
SELECT * FROM system.query_log ORDER BY event_time DESC LIMIT 10;
Shows recent queries with execution time.
Gotcha: system.parts for Storage
SELECT table, partition, bytes_on_disk FROM system.parts WHERE active;
Shows how much space each partition uses.
Tip: Monitor Merges
SELECT * FROM system.merges;
Shows ongoing merge operations. Too many merges = too many small inserts.
Gotcha: Mutations
SELECT * FROM system.mutations WHERE is_done = 0;
Tracks ongoing ALTER/DELETE operations. ClickHouse mutations are asynchronous.
Tip: Backups
BACKUP TABLE page_views TO Disk('backups', 'backup_2024.zip');
RESTORE TABLE page_views FROM Disk('backups', 'backup_2024.zip');
Native backup and restore.
Gotcha: Disk Space Monitoring
ClickHouse doesn't auto-clean. Set up alerts for disk usage.
Tip: Query Log Retention
SET query_log_retention_time = 7 * 24 * 3600;
Keep query logs for 7 days.
Gotcha: system.errors Table
SELECT * FROM system.errors ORDER BY event_time DESC LIMIT 10;
Shows recent errors across the cluster.
Tip: Grafana + Prometheus
Use the built-in Prometheus metrics endpoint:
http://clickhouse:9363/metrics
For comprehensive monitoring dashboards.
Tip: Order of Columns in ORDER BY Matters Massively
ClickHouse's primary key is defined by ORDER BY. Put high-cardinality columns first for better data skipping. ORDER BY (timestamp, user_id) is very different from ORDER BY (user_id, timestamp) in query performance.
Tip: Use LowCardinality for Enum-Like Strings
Strings like status, country, browser benefit from LowCardinality(String) — it's stored as a dictionary internally, reducing storage 10x and speeding up scans.
Gotcha: Mutations Are Heavy
ALTER TABLE ... UPDATE and DELETE in ClickHouse create new parts instead of modifying in place. A single mutation on a large table can take hours and block merges. Design for append-only from day one.
Senior Insight
ClickHouse system tables are my first stop for debugging. system.query_log shows every query executed — I use it to find slow queries and high-frequency patterns. system.parts shows table part counts — a table with thousands of tiny parts needs optimizing (OPTIMIZE TABLE ... FINAL). system.merges shows active merge operations. I've automated ClickHouse monitoring with Grafana dashboards that track merge queue depth, query latency, and disk usage.
Source: ClickHouse Blog (https://clickhouse.com/blog), Altinity Blog (https://altinity.com/blog), Altinity Knowledge Base (https://kb.altinity.com/)