Minor GHP updates#476
Conversation
There was a problem hiding this comment.
Pull request overview
Minor improvements to the GHP (geothermal heat pump) pipeline around document collection/parsing and pre-validation filtering, including tighter “substantive text” gating and optional limits on how many documents get parsed per jurisdiction.
Changes:
- Add
max_num_docs_to_parse_per_jurisdictionrequest setting and plumb it through extraction to cap the number of docs passed into plugin parsing. - Improve document-type validation prompts/graph to require substantive legal text (not just ToC/headings/citations) before treating content as legally binding.
- Normalize Docling “missing” confidence values to
None(instead ofNaN/pd.NA) and add a unit test for that behavior.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/python/unit/services/test_services_cpu.py | Adds coverage for _read_docling confidence normalization behavior. |
| tests/python/unit/pipeline/test_pipeline_orchestration.py | Updates pipeline orchestration tests to align with the new filter_docs signature. |
| tests/python/integration/test_integrated_pipeline_orchestration.py | Adjusts integration test plugin to match updated plugin interface (filter_docs). |
| docs/source/dev/advanced_plugin_development.rst | Updates plugin contract documentation for filter_docs(..., max_num_docs=None). |
| compass/validation/graphs.py | Adds a “substantive legal text” gate to the legal-text/document-type decision tree. |
| compass/services/cpu.py | Normalizes Docling confidence values via _none_if_missing helper. |
| compass/plugin/one_shot/components.py | Tightens collection prompt to avoid treating headings/ToC/citations as relevant by themselves. |
| compass/plugin/interface.py | Extends filter_docs to accept max_num_docs and implements slicing before parsing. |
| compass/plugin/base.py | Updates the abstract filter_docs contract to include max_num_docs. |
| compass/pipeline/jurisdiction.py | Adds progress-bar status update while loading pre-parsed documents. |
| compass/pipeline/extraction.py | Passes the per-jurisdiction parse cap into extractor.filter_docs(...). |
| compass/pipeline/data_classes.py | Introduces DocParsingParams and adds max_num_docs_to_parse_per_jurisdiction to requests. |
| compass/pipeline/coordinator.py | Adds manifest-loading logs for extraction mode. |
| compass/pipeline/collection/steps.py | Minor formatting change around local file loader kwargs. |
| compass/extraction/water/plugin.py | Updates filter_docs signature for the new interface (but currently uses __). |
| compass/extraction/ghp/plugin_config.yaml | Tweaks GHP query templates and heuristic keywords toward “private heat exchange wells”. |
| compass/extraction/ghp/geothermal_heat_pump_schema.json5 | Refines GHP schema scope/evidence rules to reject non-substantive headings/ToC-only evidence. |
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (74.80%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #476 +/- ##
==========================================
+ Coverage 61.72% 62.59% +0.86%
==========================================
Files 77 77
Lines 6937 7042 +105
Branches 690 704 +14
==========================================
+ Hits 4282 4408 +126
+ Misses 2535 2501 -34
- Partials 120 133 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Minor updates to GHP collection and parsing before validation