Skip to content

DAOS-19019 spdk: daos_server_helper SPDK bindings leak memory#18427

Open
knard38 wants to merge 3 commits into
masterfrom
ckochhof/fix/master/daos-19019/patch-001
Open

DAOS-19019 spdk: daos_server_helper SPDK bindings leak memory#18427
knard38 wants to merge 3 commits into
masterfrom
ckochhof/fix/master/daos-19019/patch-001

Conversation

@knard38
Copy link
Copy Markdown
Contributor

@knard38 knard38 commented Jun 3, 2026

Description

Fix two memory leaks in the SPDK C bindings used by daos_server_helper, detected by running the NVMe control unit tests under AddressSanitizer (ASAN_OPTIONS=detect_leaks=1).

Leak 1 — opts.pci_allowed (8 bytes, nvme_control.c)

daos_spdk_init() builds a PCI allowlist by calling opts_add_pci_addr(), which allocates opts.pci_allowed via realloc(). This buffer was never freed after the call to spdk_env_init(). The fix adds free(opts.pci_allowed) at the out: label, which is reached by both the success and error paths.

Leak 2 — ctrlr->pci_addr (257 bytes per controller, nvme_control_common.c)

_collect() allocates a PCI address string per controller via calloc(1, SPDK_NVMF_TRADDR_MAX_LEN + 1) and stores it in ctrlr->pci_addr. free_ctrlr_fields() freed all other heap fields (model, serial, fw_rev, vendor_id, pci_type) but omitted pci_addr. The fix adds the missing free().

Tests

Two new unit tests are added to nvme_control_ut.c, both intended to be run with ASan:

  • test_daos_spdk_init_pci_freed: calls daos_spdk_init() with one PCI address and lets ASan verify that opts.pci_allowed is freed on return. spdk_env_init() is expected to fail in a unit-test context (no hugepages); this is acceptable since the leak check is ASan-driven.

  • test_collect_pci_addr_freed: drives _collect() via mock function pointers (no real SPDK or hardware required), then lets teardown call clean_ret()free_ctrlr_fields() so ASan can verify that pci_addr is freed for each controller.

The mock_spdk_nvme_ctrlr_get_pci_device() mock is also fixed: it previously leaked a heap-allocated spdk_pci_device per call, which was invisible without ASan. It now returns entries from a static local pool (sized by MAX_MOCK_PCI_DEVS), with a module-level counter reset in teardown.

The fix was locally validated with using the following command:

ASAN_OPTIONS=detect_leaks=1 ./nvme_control_ctests

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

kanard38 added 2 commits June 3, 2026 15:57
opts_add_pci_addr() allocates opts.pci_allowed via realloc() but
daos_spdk_init() never frees it after calling spdk_env_init().
Add free(opts.pci_allowed) at the out: label so both the success
and error paths release the buffer.

Add test_daos_spdk_init_pci_freed to nvme_control_ut.c to verify
under ASan that opts.pci_allowed is freed on return. Also fix the
mock_spdk_nvme_ctrlr_get_pci_device() mock to use a static local
pool instead of calloc(), which was itself leaking under ASan.

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
_collect() allocates ctrlr->pci_addr via calloc() for each
controller, but free_ctrlr_fields() freed all other heap fields
(model, serial, fw_rev, vendor_id, pci_type) while omitting
pci_addr. Add the missing free().

Add test_collect_pci_addr_freed to nvme_control_ut.c to verify
under ASan that pci_addr is freed by teardown via
clean_ret() -> free_ctrlr_fields().

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
@knard38 knard38 marked this pull request as ready for review June 3, 2026 16:06
@knard38 knard38 requested a review from tanabarr June 3, 2026 16:06
@knard38 knard38 self-assigned this Jun 3, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Ticket title is 'daos_server_helper SPDK bindings leak memory during NVMe discovery and initialisation'
Status is 'In Review'
Labels: 'asan'
Errors are Unknown component
https://daosio.atlassian.net/browse/DAOS-19019

@knard38 knard38 requested review from Nasf-Fan and NiuYawei June 3, 2026 16:09
@knard38 knard38 changed the title Ckochhof/fix/master/daos 19019/patch 001 daos_server_helper SPDK bindings leak memory during NVMe discovery and initialisation Jun 3, 2026
@daosbuild3
Copy link
Copy Markdown
Collaborator

@knard38 knard38 changed the title daos_server_helper SPDK bindings leak memory during NVMe discovery and initialisation DAOS-19019 spdk: daos_server_helper SPDK bindings leak memory Jun 4, 2026
@daosbuild3
Copy link
Copy Markdown
Collaborator

Copy link
Copy Markdown
Contributor

@tanabarr tanabarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch, thank you

Copy link
Copy Markdown
Contributor

@Nasf-Fan Nasf-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch it!

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18427/2/testReport/

@knard38
Copy link
Copy Markdown
Contributor Author

knard38 commented Jun 5, 2026

All CI failures in build #2 are pre-existing infrastructure or unrelated issues — none are introduced by this PR (which only touches nvme_control.c, nvme_control_common.c, and nvme_control_ut.c):

@knard38 knard38 requested a review from a team June 5, 2026 14:36
@knard38
Copy link
Copy Markdown
Contributor Author

knard38 commented Jun 5, 2026

@daos-stack/daos-gatekeeper according to my previous comment, could you land this PR with the following commit:

  • title: DAOS-19019 spdk: fix two memory leaks in NVMe control bindings
  • body:
opts_add_pci_addr() allocates opts.pci_allowed via realloc() but
daos_spdk_init() never frees it after calling spdk_env_init().
Add free(opts.pci_allowed) at the out: label so both the success
and error paths release the buffer.

_collect() allocates ctrlr->pci_addr via calloc() for each
controller, but free_ctrlr_fields() freed all other heap fields
(model, serial, fw_rev, vendor_id, pci_type) while omitting
pci_addr. Add the missing free().

Two new unit tests are added to nvme_control_ut.c to verify under
ASan that opts.pci_allowed and pci_addr are both freed on return.
Also fix mock_spdk_nvme_ctrlr_get_pci_device() to use a static
local pool instead of calloc(), which was itself leaking under ASan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants