PMM-14661 Fix QAN crashes, make inserts asynchronous#5499
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## v3 #5499 +/- ##
==========================================
- Coverage 43.54% 43.53% -0.01%
==========================================
Files 413 413
Lines 42928 42943 +15
==========================================
+ Hits 18692 18697 +5
- Misses 22363 22373 +10
Partials 1873 1873
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Bring over the net changes from PR #5443 (branch PMM-15113-improve-qan-read-queries): parallel sparklines and the single-scan sumIf filter merge in the QAN reporter/profile paths, plus the accompanying reporter test and Makefile/AGENTS updates. The add-then-revert 1-hour metrics rollup churn from that PR's history is excluded; only its final net diff is applied.
| // run concurrently. It is kept well below the connection pool size (see maxOpenConns | ||
| // in main.go) so one request cannot monopolize the pool — which is shared with the | ||
| // data ingestion writer — or flood ClickHouse with concurrent scans. | ||
| const MaxParallelQueries = 4 |
There was a problem hiding this comment.
With this is maxOpenConns enough? It is 10 and shared with the ingest writer, report Select(), filter queries, and up to 4 parallel sparklines per report. I believe in this case few concurrent QAN users could exhaust the pool and time out.
There was a problem hiding this comment.
Good catch — this is moot now. I've removed the sparkline parallelization (and MaxParallelQueries) in 1434e43.
Running several sparkline aggregations concurrently multiplies peak ClickHouse memory per report, which works against the OOM/crash fix this PR targets — most acutely on the low-memory profile — and it coupled read concurrency to the shared pool (a burst of reports could starve the single ingest writer). Sparklines are serial again and maxOpenConns/maxIdleConns stay at baseline (10/5), so the pool-exhaustion concern no longer applies.
The report-latency win will be revisited separately, likely via a pre-aggregated rollup rather than more concurrency.
…o fetch sparklines sequentially
Ticket number: PMM-14661
Feature build: SUBMODULES-4406
Summary
Reduce QAN read-path scan cost and stabilize ingestion without changing the schema or result semantics. A larger structural change (a pre-aggregated rollup) seems necessary, but will be reconsidered separately.
Changes
queryDimensionusessumIf(<metric>, <active-dimension predicate>)in one scan, replacing the filtered + "enumerate-with-0"UNION ALL(two scans → one per dimension), roughly halving the filters-panel scans.async_inserton the default profile so metric inserts are buffered server-side and flushed in batches, cutting per-insert overhead and write contention on the QAN tables. We rely on the defaultwait_for_async_insert=1, per ClickHouse's strong recommendation: theINSERTstill blocks until the server flushes, so commit errors are surfaced synchronously (keeping the existing rollback/retry path in data_ingestion.go meaningful) and the server can apply backpressure instead of being overrun.wait_for_async_insert=0would silence those errors and defeat backpressure, so it is deliberately not used.requestsCapin data_ingestion.go) is intentionally unchanged. It is the backpressure/decoupling mechanism between the gRPCSavehandler and the single DB-writer goroutine, not the batching layer, soasync_insertdoes not make it redundant. Withwait_for_async_insert=1, slower commits correctly chain into agent-side backpressure; enlarging the queue would only mask that and increase in-memory data loss on restart. Tune via ClickHouse-side settings (e.g.async_insert_busy_timeout_ms) if theqan_api2_data_ingestion_requests_lengauge rides near the cap under load.Considered and dropped
errgroup) to cut report latency. Reverted: running several sparkline aggregations at once multiplies peak ClickHouse memory during a single report, which works against the OOM/crash fix this PR targets — most acutely on the low-memory profile. It also forced a larger shared connection pool (read load could starve the ingest writer). Sparklines remain serial (one query per row), andmaxOpenConns/maxIdleConnsstay at their baseline.