DAOS-17321 ddb: Add checksum dump function to ddb vos API#18293
Conversation
Fix resource leak in io_sgl_fetch(): move d_sgl_fini() before the error check so the SGL is freed regardless of vos_obj_update() return code. Fix missing newline in two D_PRINT() calls in vos_tests.c. Apply clang-format alignment to VOS_OF_* enum in vos_types.h to fix pre-existing clang-format non-conformance. Reformat io_test_init_csummer() in vts_io.c to match clang-format style. Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Add VOS_OF_FETCH_CSUM (1 << 21) flag to vos_fetch_begin() allowing callers to retrieve per-extent checksum metadata without fetching the actual data. This is intended for the ddb (DAOS Debugger) tool which needs to inspect and manage checksums stored in VOS. When VOS_OF_FETCH_CSUM is set: - SGL/bio-buffer allocation is skipped (like VOS_OF_FETCH_SIZE_ONLY). - For single-value records: save_csum() is called, then the data fetch is skipped. iod_size is not updated. - For array records: save_csum() is called as normal, and the full stored extent bounds (en_ext) plus epoch are saved via save_recx() so that callers get the raw per-version extent list paired with their checksums via vos_ioh2recx_list() and vos_ioh2ci(). VOS_OF_FETCH_CSUM and VOS_OF_FETCH_RECX_LIST are mutually exclusive since both write ic_recx_lists with incompatible semantics. An explicit -DER_INVAL check is added in vos_ioc_create() (in addition to the existing D_ASSERT in the hot path). Add unit tests for the new flag: - VOS400.1: Fetch checksum of a single-value record (CRC64). - VOS401.1: Fetch checksums of overlapping array extents (CRC16). Features: recovery Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
|
Ticket title is 'Checksum management with ddb' |
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18293/1/execution/node/1116/log |
| ioc->ic_remove = ((vos_flags & VOS_OF_REMOVE) != 0); | ||
| ioc->ic_ec = ((vos_flags & VOS_OF_EC) != 0); | ||
| ioc->ic_rebuild = ((vos_flags & VOS_OF_REBUILD) != 0); | ||
| ioc->ic_csum_fetch = ((vos_flags & VOS_OF_FETCH_CSUM) != 0); |
There was a problem hiding this comment.
Is there a specific use case for fetching only the csum without data? Since current fetch API returns both, can we just use the existing API and extract the csum from the response?
There was a problem hiding this comment.
From my investigation on this ticket, the standard fetch API is not suitable for ddb's csum_dump use case for the following reasons:
-
A normal fetch fails on corrupted records — exactly when ddb needs to inspect checksums. When the VOS scrubber detects a checksum mismatch, it marks the record address with
BIO_ADDR_IS_CORRUPTED. On subsequent normal fetches, VOS returns-DER_CSUMimmediately (vos_io.c:972-974for SV,:1099-1103for array), and the client retries on an alternative shard. Since ddb operates on a single local VOS pool (no redundancy), a normal fetch would simply fail with-DER_CSUM, making it impossible to retrieve the stored checksum metadata.VOS_OF_FETCH_CSUMskips the corruption check and data I/O entirely, returning just the stored checksums. -
For array values, a normal fetch trims checksums to the selected sub-range. In
save_csum()(line 868), a normal fetch callsevt_entry_csum_update(&entry->en_ext, &entry->en_sel_ext, ...)which adjusts the checksum info to cover onlyen_sel_ext(the selected sub-range). But ddb needs to dump the full stored extent checksums as they exist on disk — usingen_ext, noten_sel_ext— so that it can verify or display the raw stored metadata without interpretation. This was explicitly raised in the JIRA ticket discussion: when multiple raw EV ranges overlap (e.g.{0-31}and{8-1047}), we need each raw extent's checksum independently, not a trimmed version. -
ddb needs the per-extent epoch paired with each checksum.
VOS_OF_FETCH_CSUMusessave_recx()withent->en_extandent->en_epoch(lines 1202-1204), giving ddb the exact stored extent boundaries + write epoch needed to correlate each checksum with its on-disk record. As noted in the JIRA ticket, it is not possible to fetch the checksum of a hidden EV (e.g.ext~16-1039~, ep~0~) with the standard API since it usesEPOCH_MAX— the new flag allows specifying any epoch for targeted checksum retrieval. -
No data I/O or SGL allocation is needed. ddb only inspects metadata. Allocating SGLs and transferring data would be wasteful and, as noted in point 1, would fail on corrupted records.
About modifying the existing fetch API instead: the new flag approach was chosen because it doesn't change any existing fetch semantics or code paths, reuses the existing vos_ioh2ci() / vos_ioh2recx_list() interface for returning results, and follows the same pattern as VOS_OF_FETCH_SIZE_ONLY and VOS_OF_FETCH_RECX_LIST.
The follow-up PR (#17365) consumes both vos_ioh2ci() and vos_ioh2recx_list() from a VOS_OF_FETCH_CSUM fetch in dv_dump_csum() to provide the csum_dump functionality described in the ticket.
Does it makes sense ?
There was a problem hiding this comment.
If only fetch checksum without original data, how can we verify whether the checksum can match related data or not?
There was a problem hiding this comment.
From my investigation on this ticket, the standard fetch API is not suitable for ddb's
csum_dumpuse case for the following reasons:1. **A normal fetch fails on corrupted records — exactly when ddb needs to inspect checksums.** When the VOS scrubber detects a checksum mismatch, it marks the record address with `BIO_ADDR_IS_CORRUPTED`. On subsequent normal fetches, VOS returns `-DER_CSUM` immediately (`vos_io.c:972-974` for SV, `:1099-1103` for array), and the client retries on an alternative shard. Since ddb operates on a single local VOS pool (no redundancy), a normal fetch would simply fail with `-DER_CSUM`, making it impossible to retrieve the stored checksum metadata. `VOS_OF_FETCH_CSUM` skips the corruption check and data I/O entirely, returning just the stored checksums. 2. **For array values, a normal fetch trims checksums to the selected sub-range.** In `save_csum()` (line 868), a normal fetch calls `evt_entry_csum_update(&entry->en_ext, &entry->en_sel_ext, ...)` which adjusts the checksum info to cover only `en_sel_ext` (the selected sub-range). But ddb needs to dump the **full stored extent checksums** as they exist on disk — using `en_ext`, not `en_sel_ext` — so that it can verify or display the raw stored metadata without interpretation. This was explicitly raised in the [JIRA ticket discussion](https://daosio.atlassian.net/browse/DAOS-17321): when multiple raw EV ranges overlap (e.g. `{0-31}` and `{8-1047}`), we need each raw extent's checksum independently, not a trimmed version. 3. **ddb needs the per-extent epoch paired with each checksum.** `VOS_OF_FETCH_CSUM` uses `save_recx()` with `ent->en_ext` and `ent->en_epoch` (lines 1202-1204), giving ddb the exact stored extent boundaries + write epoch needed to correlate each checksum with its on-disk record. As noted in the JIRA ticket, it is not possible to fetch the checksum of a hidden EV (e.g. `ext~16-1039~, ep~0~`) with the standard API since it uses `EPOCH_MAX` — the new flag allows specifying any epoch for targeted checksum retrieval. 4. **No data I/O or SGL allocation is needed.** ddb only inspects metadata. Allocating SGLs and transferring data would be wasteful and, as noted in point 1, would fail on corrupted records.About modifying the existing fetch API instead: the new flag approach was chosen because it doesn't change any existing fetch semantics or code paths, reuses the existing
vos_ioh2ci()/vos_ioh2recx_list()interface for returning results, and follows the same pattern asVOS_OF_FETCH_SIZE_ONLYandVOS_OF_FETCH_RECX_LIST.The follow-up PR (#17365) consumes both
vos_ioh2ci()andvos_ioh2recx_list()from aVOS_OF_FETCH_CSUMfetch indv_dump_csum()to provide thecsum_dumpfunctionality described in the ticket.Does it makes sense ?
Yes. Thanks a lot for the explanation.
There was a problem hiding this comment.
If only fetch checksum without original data, how can we verify whether the checksum can match related data or not?
This PR is a preliminary step to support the ddb csum_dump command. For that command, we only need to display the checksums — we don't need to verify them against the actual data, so fetching the data is not required.
For the follow-up commands csum_check and csum_resync, the data will definitely be needed to perform the verification, and I will handle that in the follow-up PRs.
Does that make sense?
There was a problem hiding this comment.
It is clear now, thanks for the explanation.
…/daos-17321/patch-001
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18293/2/execution/node/1116/log |
| d_sgl_fini(&sgl, false); | ||
| out: | ||
| D_FREE(update_buf); | ||
| assert_rc_equal(rc, 0); |
There was a problem hiding this comment.
How to guarantee this? for example from L#3093, L#3099, and so on.
There was a problem hiding this comment.
You're right — if one of the intermediate steps fails (e.g. d_sgl_init), goto out leaves rc != 0 and the assert_rc_equal(rc, 0) at out: will fire, failing the test. The intent was to have the test fail if any step doesn't succeed. However, I can see how mixing error-cleanup and success-assertion in the same label is confusing and could mask the actual failure point.
I am going to restructure the labels to separate the error paths from the final success assertion.
- Improve unexpected error failure management of
io_csum_fetch_single()
There was a problem hiding this comment.
After more investigation, the goto solution turns out to be the best option, as it allows proper cleanup of memory allocations before the test exits.
Using assert_rc_equal directly at intermediate steps (e.g. after d_sgl_init) would trigger longjmp on failure, skipping any subsequent cleanup. Memory leak detectors such as Valgrind would then report spurious leaks, which could be confusing.
To make unexpected failures easier to diagnose, I added a print_message() call before each goto so the failing step is clearly identified in the test output. More details in commit 940b859.
There was a problem hiding this comment.
After further analysis, the goto+print_message approach turns out to be inconsistent: the assertions in the check phase (after the setup steps succeed) — assert_null(rel), assert_non_null(cil), assert_int_equal(cil->dcl_csum_infos_nr, ...), etc. — can also trigger longjmp via cmocka and bypass the cleanup labels. So the goto structure only protects cleanup for setup-phase failures, not for check-phase failures. This creates a false sense of security while adding complexity.
Since cleanup is never fully guaranteed in cmocka without a dedicated teardown function, using assert_rc_equal() directly throughout is simpler and equally correct: on failure, cmocka already reports the file and line of the failing assertion, which is as informative as a preceding print_message().
Commit b52bafc simplifies the four affected functions (io_csum_fetch_single, io_csum_fetch_recx, io_csum_write_no_csum, io_csum_fetch_recx_missing_csum) to use assert_rc_equal() throughout with a flat cleanup at the end.
There was a problem hiding this comment.
The same simplification was also applied to io_csum_update_recx() in commit 75d6570: the function is converted from int to void, the akey parameter gains a const qualifier, and all goto-based cleanup is replaced with assert_rc_equal() / assert_non_null().
| assert_true(daos_csummer_compare_csum_info(csummer, ic[0]->ic_data, ci)); | ||
| ci = dcs_csum_info_get(cil, 1); | ||
| assert_true(ci_is_valid(ci)); | ||
| assert_int_equal(ci->cs_nr, 3); |
There was a problem hiding this comment.
The second write uses recx_idx = csum_chunk_size / 2 = 32 and recx_nr = 2 * csum_chunk_size = 128. With csum_chunk_size = 64, the chunk boundaries fall at 0, 64, 128, 192…. Because the write starts at offset 32 (not aligned to a chunk boundary), it touches three chunks and therefore requires three checksums (cs_nr = 3):
byte offset: 0 32 64 128 160 192
│ | | | | | |
chunk 0: [-----------------]
chunk 1: [-----------------]
chunk 2: [-----------------]
│ |
write 2: [=========================>]
(ep=2) recx_idx=32 recx_idx+recx_nr=160
The misalignment of the start index is deliberate to exercise the non-aligned case.
That said, I agree the test case is not immediately obvious. I'll add inline comments and named constants to make the intent clearer.
- Use named constant and add comments
| } | ||
| daos_csummer_destroy(&csummer); | ||
| out: | ||
| assert_rc_equal(rc, 0); |
There was a problem hiding this comment.
Can not be guaranteed if out from L#3230 and L#3236.
There was a problem hiding this comment.
Same issue as the io_csum_fetch_single case. I'll fix both test functions together by restructuring the cleanup labels.
- Improve unexpected error failure management of
io_csum_fetch_recx()
| ioc->ic_remove = ((vos_flags & VOS_OF_REMOVE) != 0); | ||
| ioc->ic_ec = ((vos_flags & VOS_OF_EC) != 0); | ||
| ioc->ic_rebuild = ((vos_flags & VOS_OF_REBUILD) != 0); | ||
| ioc->ic_csum_fetch = ((vos_flags & VOS_OF_FETCH_CSUM) != 0); |
There was a problem hiding this comment.
If only fetch checksum without original data, how can we verify whether the checksum can match related data or not?
| @@ -1179,9 +1191,24 @@ akey_fetch_recx(daos_handle_t toh, const daos_epoch_range_t *epr, | |||
There was a problem hiding this comment.
If firstly N entries have no checksum, but subsequent entries have, then current logic cannot detect. That is some kind of short-comings, needs to be improved.
On the other hand, for ddb, if we lost checksum for some entries, we need to make the caller to know that via the returned <csum, epoch> lists. Otherwise, the caller maybe not aware of related corruption.
There was a problem hiding this comment.
You're right on both points.
For the csum_enabled blind spot: I'll add a csum_missing flag alongside csum_enabled so that both transition directions ([no-csum → has-csum] and [has-csum → no-csum]) are detected.
For the ddb case: I'll remove the csum_enabled gate on save_recx so it is always called when ic_csum_fetch is set, and save a null/invalid placeholder in the csum list for entries with no checksum. This way the caller can see all entries and identify which ones are missing a checksum.
I'll also add a unit test that writes one extent without a checksum (NULL iod_csums) and one with, then verifies both appear in the returned <csum, epoch> list with the no-csum entry properly flagged.
- [ ] Add csum_missing flag in akey_fetch_recx() to detect both inconsistency directions
- [ ] Save a null placeholder in the csum list for entries with no checksum
- Add
io_csum_fetch_recx_missing_csumunit test
There was a problem hiding this comment.
After further investigation, the csum_missing flag approach turns out to be unworkable due to an EVT limitation: evt_entry_csum_fill() propagates tr_csum_len — a root-level field — into the csum metadata of every non-hole entry at fetch time. As a result, ci_is_valid() always returns true for non-hole entries whenever tr_csum_len > 0, even for extents that were written without a checksum. Detecting the mixed case (some extents with, some without a checksum) at fetch time would require per-evt_desc metadata, which would change the VOS persistent format — clearly out of scope for this PR.
Since the mixed case is indistinguishable, null placeholders are also unnecessary: the csum list will either be fully populated (all entries have a checksum) or empty (none do). Commit 424d306 removes the csum_enabled gate so that save_recx() is always called when ic_csum_fetch is set, and leaves the csum list empty when no checksums are present rather than filling it with null placeholders. The io_csum_fetch_recx_missing_csum unit test validates this behaviour.
…/daos-17321/patch-001
Add print_message() before each goto so the failing operation is named in the test output. Keep goto-based cleanup labels to avoid memory leaks on longjmp. Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Unroll the two-iteration loop into explicit writes with named constants (recx_size, recx2_idx) to eliminate magic numbers. Add a write-layout comment before the function declaration describing the chunk-boundary arithmetic and the expected cs_nr values for each write. Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
save_recx() was gated on a local csum_enabled flag, so extents without a stored checksum were absent from the VOS_OF_FETCH_CSUM output used by ddb. Remove the gate so save_recx() is always called when ic_csum_fetch is set. When no entry in the akey stored a checksum (EVT root tr_csum_len==0), ci_is_valid() reliably returns false for all entries. In that case the csum list is left empty (dcl_csum_infos_nr==0), which is a cleaner signal than filling it with null placeholders that the caller would have to iterate through to reach the same conclusion. Mixed-checksum detection (some entries with, some without) is not added: ci_is_valid() is unreliable for that purpose because evt_entry_csum_fill() propagates the root-level tr_csum_len to all non-hole entries, making ci_is_valid() return true even for entries stored without a checksum. Add VOS401.2 (io_csum_fetch_recx_missing_csum) verifying that no-csum extents appear in the recx list and the csum list is empty. Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
…/daos-17321/patch-001
Replace the goto-based error handling in the csum fetch test functions with direct assert_rc_equal() calls throughout. The goto+cleanup-label structure gives only a partial guarantee: the assertions in the check phase (after setup) already bypass the cleanup labels via longjmp on failure. Since cleanup is never fully guaranteed in cmocka without a teardown function, the goto structure in the setup phase provides a false sense of security while adding complexity. Using assert_rc_equal() directly is simpler, consistent throughout each function, and equally correct: cmocka reports the file and line of the failing assertion, which is as informative as a preceding print_message(). Functions simplified: - io_csum_fetch_single - io_csum_fetch_recx - io_csum_write_no_csum - io_csum_fetch_recx_missing_csum Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
…e_recx Apply the same assert_rc_equal() simplification to io_csum_update_recx() that was applied to the other csum test helpers in commit b52bafc. The function is converted from returning int to void, the akey parameter gains a const qualifier, and the goto-based cleanup is replaced with assert_rc_equal() and assert_non_null() throughout. Callers in io_csum_fetch_recx no longer capture the return value. Note: unlike the test functions with a check phase, io_csum_update_recx is a pure write helper where the original goto approach did protect cleanup. The simplification is still appropriate for consistency and the practical risk is negligible (small test allocations, operations that virtually never fail in a test environment). Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18293/5/execution/node/1048/log |
…/daos-17321/patch-001
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18293/6/execution/node/1038/log |
…/daos-17321/patch-001
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18293/7/display/redirect |
|
Test stage Test RPMs on EL 9.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18293/8/execution/node/431/log |
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18293/8/execution/node/491/log |
…/daos-17321/patch-001
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18293/9/testReport/ |
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18293/9/execution/node/1101/log |
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18293/10/testReport/ |
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18293/10/execution/node/1101/log |
…/daos-17321/patch-001
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18293/11/testReport/ |
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18293/11/execution/node/1086/log |
…/daos-17321/patch-001
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18293/14/testReport/ |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18293/14/testReport/ |
|
All CI failures in build #14 are pre-existing infrastructure or unrelated issues — none are introduced by this PR (which only touches
|
|
@daos-stack/daos-gatekeeper according to my previous comment, could you lend this PR with the following commit:
|
Description
This PR is the first in a series of four addressing DAOS-17321, which adds checksum management commands (
csum_dump,csum_check,csum_resync) to the DAOS Debugger (ddb).It adds a new VOS fetch flag
VOS_OF_FETCH_CSUMtovos_fetch_begin(). When set, the flag allows a caller to retrieve per-extent checksum metadata stored in VOS without reading or transferring the actual data payload. Thisis the prerequisite for the subsequent ddb patches that implement
csum_dump.Behavior of
VOS_OF_FETCH_CSUMVOS_OF_FETCH_SIZE_ONLY).save_csum()then returns early;iod_sizeis intentionally not updated on a csum-only fetch.en_ext) paired with its epoch viasave_recx(), then saves the corresponding checksum viasave_csum(). The full stored extent (not the selected sub-range) is used to preserve the checksum chunk-boundary alignment.VOS_OF_FETCH_RECX_LIST: both flags writeic_recx_listswith incompatible semantics; combining them returns-DER_INVAL.Commits
VOS_OF_*enum, missing\nin twoD_PRINT()calls,d_sgl_fini()resource-leak fix after a failed update, and copyright bumps.VOS_OF_FETCH_CSUMfeature — new flag, VOS I/O skip paths, flag validation guard, and unit tests.New tests
io_csum_fetch_single: writes a single value with a CRC64 checksum, fetches it withVOS_OF_FETCH_CSUM, and verifies the retrieveddcs_csum_infomatches the stored one.io_csum_fetch_recx: writes two overlapping array extents at different epochs with CRC16 checksums, then verifies that both extents and their checksums (with correct chunk counts) are returned.Related PRs (in order)
VOS_OF_FETCH_CSUMflagddb_vos_csum_dump()csum_dumpcommandcsum_dumpCLI bindingSteps for the author:
After all prior steps are complete: