SRE-3703 ci: Fault injection testing stage on VM (#17953) 2.6#18349
SRE-3703 ci: Fault injection testing stage on VM (#17953) 2.6#18349grom72 wants to merge 9 commits into
Conversation
unitTestPost() already processes nlt-junit.xml via the testResults parameter it receives. The bare 'junit testResults: nlt-junit.xml' call that follows is redundant and has no failure protection: it uses the default healthScaleFactor so when fault injection tests intentionally produce failures in nlt-junit.xml it marks the build FAILURE immediately, overriding the controlled result handling done by unitTestPost(). When node_local_test.py runs with --no-root, DAOS logs are written to /localhome/jenkins/build/nlt_logs/ instead of /tmp/. The existing rsync only fetches from /tmp/, leaving nlt_logs/ empty and causing: No artifacts found that match the file pattern "nlt_logs/". Configuration error? Add a second rsync from build/nlt_logs/ to collect logs from the --no-root code path. The '|| true' ensures non-fatal behavior when the path does not exist (plain NLT runs without --no-root). Jenkinsfile: simplify NLT fault injection recordIssues call The vm_test/nlt-errors.json issue scanning for the 'NLT Fault injection testing' stage is now handled by unitTestPost() in pipeline-lib, so remove it from the explicit recordIssues call here. fault_status falback only based on PATH - Add fallback `fault_status` detection: if the primary detection via `$PREFIX/bin` fails, try resolving `fault_status` via `$PATH`, improving robustness when the binary is installed via RPM rather than built in-tree. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true nlt: remove ABT_STACK_OVERFLOW_CHECK=mprotect from nlt_server.yaml mprotect-based Argobots ULT stack overflow checking causes a TLB shootdown IPI on every stack allocation/deallocation. On KVM hosts running multiple VMs in parallel this results in VM exits across all vCPUs, significantly increasing latency under concurrent load. Remove the setting to use the default (no overflow check), which is acceptable for a CI/test environment where crashes are already caught by the test harness. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true ci: explicitly pass NLT/FI parameters to unitTest and unitTestPost pipeline-lib now supports overriding NLT/FI defaults (always_script, testResults, valgrind_pattern, with_valgrind, NLT, FI) via the config map, taking priority over the values auto-detected from the stage name by parseStageInfo. Make the Jenkinsfile stages explicit to take advantage of this and to make the stage configuration self-documenting. NLT stage (unitTest call): - Add with_valgrind: 'memcheck', valgrind_pattern: '*memcheck.xml', always_script: 'ci/unit/test_nlt_post.sh', testResults: 'nlt-junit.xml' NLT stage (unitTestPost call): - Remove always_script (now passed to unitTest above) - Add NLT: true to explicitly activate the NLT post-processing block (recordIssues, discoverGitReferenceBuild) instead of relying on stage name detection - Add valgrind_pattern: '*memcheck.xml' for the valgrind_stash NLT Fault injection testing stage (unitTest call): - Add always_script: 'ci/unit/test_nlt_post.sh', testResults: 'nlt-junit.xml' - Add with_valgrind: '' to explicitly suppress valgrind for FI NLT Fault injection testing stage (unitTestPost call): - Replace always_script with FI: true to explicitly activate fault injection post-processing (nlt-client-leaks.json, 'Fault injection' naming, discoverGitReferenceBuild) instead of relying on the now- removed stage name auto-detection of FI in parseStageInfo Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true
|
Errors are Unable to load ticket data |
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18349/1/testReport/ |
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18349/2/testReport/ |
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: false Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true
|
Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/3/execution/node/279/log |
|
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/3/execution/node/253/log |
|
Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/3/execution/node/266/log |
|
Test stage NLT completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18349/3/display/redirect |
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18349/4/testReport/ |
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18349/5/testReport/ |
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true Skip-func-test-el8: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
f62657e to
18e4d32
Compare
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18349/7/testReport/ |
additionally increas log size Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true Skip-func-test-el8: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Use dfuse_dir rather than tempfile default to avoid landing on a tmpfs (e.g. nlt_logs) which does not support user xattrs on older kernels (RHEL 8 / kernel < 5.15), causing duns_create_path() to fail with DER_NOTSUPPORTED. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-func-test-el9: true Skip-func-test-leap15: true Skip-test-el-8-rpms: true Skip-func-hw-test: true Skip-func-test-el8: true
|
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/11/execution/node/324/log |
|
Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/11/execution/node/337/log |
|
Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/11/execution/node/350/log |
|
Test stage NLT completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18349/11/display/redirect |
…3-2.6 Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-func-test-el9: true Skip-func-test-leap15: true Skip-test-el-8-rpms: true Skip-func-hw-test: true Skip-func-test-el8: true
|
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/12/execution/node/319/log |
|
Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/12/execution/node/316/log |
|
Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/12/execution/node/393/log |
|
Test stage NLT completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18349/12/display/redirect |
|
Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/13/execution/node/316/log |
|
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/13/execution/node/346/log |
|
Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/13/execution/node/400/log |
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-func-test-el9: true Skip-func-test-leap15: true Skip-test-el-8-rpms: true Skip-func-hw-test: true Skip-func-test-el8: true
76b5000 to
045b9f7
Compare
|
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/14/execution/node/307/log |
|
Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/14/execution/node/310/log |
|
Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18349/14/execution/node/365/log |
|
Test stage NLT completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18349/14/display/redirect |
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-func-test-el9: true Skip-func-test-leap15: true Skip-test-el-8-rpms: true Skip-func-hw-test: true Skip-func-test-el8: true
Backport of: #17953
unitTestPost() already processes nlt-junit.xml via the testResults parameter it receives. The bare 'junit testResults: nlt-junit.xml' call that follows is redundant and has no failure protection: it uses the default healthScaleFactor so when fault injection tests intentionally produce failures in nlt-junit.xml it marks the build FAILURE immediately, overriding the controlled result handling done by unitTestPost().
When node_local_test.py runs with --no-root, DAOS logs are written to /localhome/jenkins/build/nlt_logs/ instead of /tmp/. The existing rsync only fetches from /tmp/, leaving nlt_logs/ empty and causing:
No artifacts found that match the file pattern "nlt_logs/". Configuration error?
Add a second rsync from build/nlt_logs/ to collect logs from the --no-root code path. The '|| true' ensures non-fatal behavior when the path does not exist (plain NLT runs without --no-root).
Jenkinsfile: simplify NLT fault injection recordIssues call
The vm_test/nlt-errors.json issue scanning for the 'NLT Fault injection testing' stage is now handled by unitTestPost() in pipeline-lib, so remove it from the explicit recordIssues call here.
fault_status falback only based on PATH
fault_statusdetection: if the primary detection via$PREFIX/binfails, try resolvingfault_statusvia$PATH, improving robustness when the binary is installed via RPM rather than built in-tree.nlt: remove ABT_STACK_OVERFLOW_CHECK=mprotect from nlt_server.yaml
mprotect-based Argobots ULT stack overflow checking causes a TLB shootdown IPI on every stack allocation/deallocation. On KVM hosts running multiple VMs in parallel this results in VM exits across all vCPUs, significantly increasing latency under concurrent load.
Remove the setting to use the default (no overflow check), which is acceptable for a CI/test environment where crashes are already caught by the test harness.
ci: explicitly pass NLT/FI parameters to unitTest and unitTestPost
pipeline-lib now supports overriding NLT/FI defaults (always_script, testResults, valgrind_pattern, with_valgrind, NLT, FI) via the config map, taking priority over the values auto-detected from the stage name by parseStageInfo. Make the Jenkinsfile stages explicit to take advantage of this and to make the stage configuration self-documenting.
NLT stage (unitTest call):
NLT stage (unitTestPost call):
NLT Fault injection testing stage (unitTest call):
NLT Fault injection testing stage (unitTestPost call):
Steps for the author:
After all prior steps are complete: