Skip to content

fix(agents): resolve rds_exporter port drift on pmm-agent reconnect#5489

Open
MuraliMon wants to merge 8 commits into
percona:mainfrom
MuraliMon:fix/rds-exporter-roster-encrypted-access-key
Open

fix(agents): resolve rds_exporter port drift on pmm-agent reconnect#5489
MuraliMon wants to merge 8 commits into
percona:mainfrom
MuraliMon:fix/rds-exporter-roster-encrypted-access-key

Conversation

@MuraliMon

@MuraliMon MuraliMon commented Jun 11, 2026

Copy link
Copy Markdown

roster.get's cold-cache fallback filtered FindAgents by AWSAccessKey, which builds a SQL predicate on aws_options->>'aws_access_key'. That column is encrypted at rest (EncryptAWSOptionsHandler), and FindAgents decrypts rows only after the WHERE clause is applied, so the plaintext access key is compared against ciphertext and matches zero rows.

This fallback runs when the in-memory roster has been cleared (e.g. after a pmm-agent reconnect). With an empty result the StateChanged handler updates no agents, leaving the grouped rds_exporter listen ports stale in the DB. VictoriaMetrics' scrape configuration is built from those ports, so it keeps scraping the old (now dead) ports and RDS metrics disappear until the roster cache is warm again.

Fix: drop the encrypted-column SQL filter and match the access key in Go on the decrypted AWSOptions returned by FindAgents. This mirrors how the SetState builder groups rds_exporters and is correct regardless of column encryption. The cache-hit path is unchanged.

Ticket number: https://perconadev.atlassian.net/browse/PMM-14267

Feature build: SUBMODULES-0

If this PR adds, removes or alters one or more API endpoints, please review and update the relevant API documentation as well:

  • API Docs updated

If this PR is related to other PRs, contributions, or ongoing work in this or other repositories, please reference them here:

  • Links to related work items (optional).

@MuraliMon MuraliMon requested a review from a team as a code owner June 11, 2026 10:10
@MuraliMon MuraliMon requested review from 4nte and maxkondr and removed request for a team June 11, 2026 10:10

@ademidoff ademidoff left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix — the root-cause analysis is correct: the cold-cache fallback filtered on the encrypted aws_options->>'aws_access_key' column while FindAgents decrypts only after the WHERE clause, so it matched zero rows. Matching in Go on the decrypted value is the right call and mirrors the grouping in state.go. A few changes before merge:

1. Trim the comment (roster.go:88-95)

The 8-line comment is more verbose than needed. The essential point is the encryption ordering. Suggest something like:

// aws_access_key is encrypted at rest, and FindAgents decrypts rows only
// after the WHERE clause, so a SQL filter on the access key never matches.
// Match it in Go on the decrypted value instead.

2. Add a regression test (roster_test.go)

The current Get/Clear subtests hit the fallback branch but assert an empty result, so they pass identically with or without this fix — they don't cover the bug. Please add a test that:

  • seeds a pmm-agent + an rds_exporter row with a non-empty AWSOptions.AWSAccessKey in the DB,
  • calls r.get("<pmmAgentID>:rds/<accessKey>") on a cold roster (no prior add),
  • asserts the exporter's AgentID is returned.

This fails before the fix and passes after, which is what guards against regression.

3. Clean up the now-dead filter

This was the only caller of AgentFilters.AWSAccessKey (verified repo-wide). With it gone, the struct field (agent_helpers.go:226) and the filter branch (agent_helpers.go:270-274) are orphaned — and that branch carries the same plaintext-vs-ciphertext defect, so leaving it is a footgun for the next person who reuses it. Please remove both in this PR.

4. Reference the real ticket

The PR body and commit use placeholders (PMM-0 / SUBMODULES-0). Please point them at the actual Jira ticket per team convention before merge.

@MuraliMon MuraliMon force-pushed the fix/rds-exporter-roster-encrypted-access-key branch from a360f85 to 0dd7e98 Compare June 12, 2026 07:45
…econnect

roster.get's cold-cache fallback filtered FindAgents by AWSAccessKey, which
builds a SQL predicate on aws_options->>'aws_access_key'. That column is
encrypted at rest (EncryptAWSOptionsHandler), and FindAgents decrypts rows
only after the WHERE clause is applied, so the plaintext access key is
compared against ciphertext and matches zero rows.

This fallback runs when the in-memory roster has been cleared (e.g. after a
pmm-agent reconnect). With an empty result the StateChanged handler updates no
agents, leaving the grouped rds_exporter listen ports stale in the DB.
VictoriaMetrics' scrape configuration is built from those ports, so it keeps
scraping the old (now dead) ports and RDS metrics disappear until the roster
cache is warm again.

Fix: drop the encrypted-column SQL filter and match the access key in Go on
the decrypted AWSOptions returned by FindAgents. This mirrors how the SetState
builder groups rds_exporters and is correct regardless of column encryption.
The cache-hit path is unchanged.
@MuraliMon MuraliMon force-pushed the fix/rds-exporter-roster-encrypted-access-key branch from 0dd7e98 to 4adf08b Compare June 12, 2026 07:50
@MuraliMon

Copy link
Copy Markdown
Author

@ademidoff Changes are done

@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.74%. Comparing base (21bd8d4) to head (ffc15c4).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5489      +/-   ##
==========================================
+ Coverage   43.47%   43.74%   +0.27%     
==========================================
  Files         413      413              
  Lines       42953    42481     -472     
==========================================
- Hits        18675    18585      -90     
+ Misses      22400    22053     -347     
+ Partials     1878     1843      -35     
Flag Coverage Δ
managed 42.86% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@MuraliMon MuraliMon requested a review from ademidoff June 15, 2026 06:38

@ademidoff ademidoff left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

@MuraliMon

Copy link
Copy Markdown
Author

@4nte Can you please review and approve this?

@MuraliMon

Copy link
Copy Markdown
Author

@ademidoff What would be the next steps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants