fix(agents): resolve rds_exporter port drift on pmm-agent reconnect#5489
fix(agents): resolve rds_exporter port drift on pmm-agent reconnect#5489MuraliMon wants to merge 8 commits into
Conversation
There was a problem hiding this comment.
Thanks for the fix — the root-cause analysis is correct: the cold-cache fallback filtered on the encrypted aws_options->>'aws_access_key' column while FindAgents decrypts only after the WHERE clause, so it matched zero rows. Matching in Go on the decrypted value is the right call and mirrors the grouping in state.go. A few changes before merge:
1. Trim the comment (roster.go:88-95)
The 8-line comment is more verbose than needed. The essential point is the encryption ordering. Suggest something like:
// aws_access_key is encrypted at rest, and FindAgents decrypts rows only
// after the WHERE clause, so a SQL filter on the access key never matches.
// Match it in Go on the decrypted value instead.2. Add a regression test (roster_test.go)
The current Get/Clear subtests hit the fallback branch but assert an empty result, so they pass identically with or without this fix — they don't cover the bug. Please add a test that:
- seeds a pmm-agent + an rds_exporter row with a non-empty
AWSOptions.AWSAccessKeyin the DB, - calls
r.get("<pmmAgentID>:rds/<accessKey>")on a cold roster (no prioradd), - asserts the exporter's
AgentIDis returned.
This fails before the fix and passes after, which is what guards against regression.
3. Clean up the now-dead filter
This was the only caller of AgentFilters.AWSAccessKey (verified repo-wide). With it gone, the struct field (agent_helpers.go:226) and the filter branch (agent_helpers.go:270-274) are orphaned — and that branch carries the same plaintext-vs-ciphertext defect, so leaving it is a footgun for the next person who reuses it. Please remove both in this PR.
4. Reference the real ticket
The PR body and commit use placeholders (PMM-0 / SUBMODULES-0). Please point them at the actual Jira ticket per team convention before merge.
a360f85 to
0dd7e98
Compare
…econnect roster.get's cold-cache fallback filtered FindAgents by AWSAccessKey, which builds a SQL predicate on aws_options->>'aws_access_key'. That column is encrypted at rest (EncryptAWSOptionsHandler), and FindAgents decrypts rows only after the WHERE clause is applied, so the plaintext access key is compared against ciphertext and matches zero rows. This fallback runs when the in-memory roster has been cleared (e.g. after a pmm-agent reconnect). With an empty result the StateChanged handler updates no agents, leaving the grouped rds_exporter listen ports stale in the DB. VictoriaMetrics' scrape configuration is built from those ports, so it keeps scraping the old (now dead) ports and RDS metrics disappear until the roster cache is warm again. Fix: drop the encrypted-column SQL filter and match the access key in Go on the decrypted AWSOptions returned by FindAgents. This mirrors how the SetState builder groups rds_exporters and is correct regardless of column encryption. The cache-hit path is unchanged.
0dd7e98 to
4adf08b
Compare
|
@ademidoff Changes are done |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5489 +/- ##
==========================================
+ Coverage 43.47% 43.74% +0.27%
==========================================
Files 413 413
Lines 42953 42481 -472
==========================================
- Hits 18675 18585 -90
+ Misses 22400 22053 -347
+ Partials 1878 1843 -35
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
ademidoff
left a comment
There was a problem hiding this comment.
Thanks for the contribution!
|
@4nte Can you please review and approve this? |
|
@ademidoff What would be the next steps? |
roster.get's cold-cache fallback filtered FindAgents by AWSAccessKey, which builds a SQL predicate on aws_options->>'aws_access_key'. That column is encrypted at rest (EncryptAWSOptionsHandler), and FindAgents decrypts rows only after the WHERE clause is applied, so the plaintext access key is compared against ciphertext and matches zero rows.
This fallback runs when the in-memory roster has been cleared (e.g. after a pmm-agent reconnect). With an empty result the StateChanged handler updates no agents, leaving the grouped rds_exporter listen ports stale in the DB. VictoriaMetrics' scrape configuration is built from those ports, so it keeps scraping the old (now dead) ports and RDS metrics disappear until the roster cache is warm again.
Fix: drop the encrypted-column SQL filter and match the access key in Go on the decrypted AWSOptions returned by FindAgents. This mirrors how the SetState builder groups rds_exporters and is correct regardless of column encryption. The cache-hit path is unchanged.
Ticket number: https://perconadev.atlassian.net/browse/PMM-14267
Feature build: SUBMODULES-0
If this PR adds, removes or alters one or more API endpoints, please review and update the relevant API documentation as well:
If this PR is related to other PRs, contributions, or ongoing work in this or other repositories, please reference them here: