fix(pampa): treat parse as clean when tree has no ERROR nodes#218
fix(pampa): treat parse as clean when tree has no ERROR nodes#218rundel wants to merge 1 commit into
Conversation
GLR speculative parsing can hit `detect_error` in dead branches while another branch reaches accept cleanly. Previously the qmd reader reported diagnostics whenever `log_observer.had_errors()` was true, including the speculative errors from those dead branches, even when the resulting tree was free of ERROR nodes. Use the presence of `ERROR` nodes in the final tree as the ground truth for whether the parse actually failed. When `had_errors()` is true but `collect_error_node_ranges(&tree)` is empty, fall through to the success path. Visible fix: `*a" b."*` now parses cleanly as `<em>a"b."</em>` (the two `"` form a paired double-quote span); previously it emitted a spurious "unclosed `*`" diagnostic for the outer emphasis. Isolated from bugfix/quoted-emphasis-word (commit a01e3bc, where it was folded in alongside the Merr key extension as a side fix); the main inline-scope misclassification work is unaffected and stays on that branch. Tests ----- New regression file `crates/pampa/tests/test_glr_dead_branch_speculation.rs` with `nested_double_quote_inside_emphasis_parses_cleanly` verifying `*a" b."*` produces no diagnostics. Verified failing on main HEAD prior to the fix, passing after. `cargo nextest run -p pampa` → 3760 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
We need to be a little careful here, and I want to think through this thoroughly. I think this isn't a valid parse; I think what you're seeing is error correction. Starting quotes cannot be followed by spaces (and ending quotes cannot be trailed by spaces): Valid: My understanding was that tree-sitter only does GLR-speculative execution when the grammar has a non-empty tree-sitter also uses error correction, which can hide errors. But it can also produce bad tree-sitter nodes, and I want to avoid that (because it makes downstream processing much harder, since we can't assume much about the shape of the tree). I think that's what you're seeing. |
|
In contrast, the grammatically-valid input doesn't suffer from this: So what you're seeing isn't speculative GLR-execution. It's that tree-sitter borrows the same speculation mechanism to perform error correction. IIRC, It has a collection of heuristics like "skip next token" or "change token to one that would be parsed correctly", etc. In those cases, it "invents" a speculative execution path with a different token stream instead of forking on ambiguous rules. We definitely don't want to support that, and the error message was originally correct. I'm actually going to go ahead and close this, but we can continue a discussion! |
|
Interesting I had missed the bit about pampa behaviorpandoc behaviorAreas where I am seeing inconsistency:
Are these specific design choices or side effects of the parser(s)? |
FYI, Pandoc never errors in parsing. It considers this a feature (including enshrining this into the CommonMark spec). I consider it a bad decision.
That's interesting. I hadn't noticed it and I might be persuaded to change the parser. I think the current behavior is there so that punctuation works well in languages that put punctuation after quotes, like
Fair question, and the answer is annoyingly in the middle. So, let's say they are "general design choices". I intentionally want typos to be syntax errors. Is
Clearly we're assuming option 3; Pandoc assumes 2, and maybe we should be thinking it's 1? |
|
I should also warn (again?) that the harder you look at Markdown, the more you'll start losing your mind... |
|
Yeah - I had purposed avoided dealing with as much markdown syntax as possible in my previous sojourns and I had similar wtf moments when working on the md4c / md4r stuff and trying to reconcile with the spec documents. My naive two cents - I think I prefer the pampa approach of " being syntactically meaningful all the time and the failing over to the literal seems like a bad idea. I would also argue for modifying the parser so that a space before the opening " is necessary - I'm not sure how to cleanly deal with the closing quote and possible punctuation, that seems fraught. I would also say that the closing " consuming the proceeding spaces is not good and that should also be a error rather than getting silently resolved. Most of the above then also need more specific error codes to make it clearer what exactly went wrong and how to fix things. |
This fix was buried in #217 - surfacing it as its own PR.
GLR speculative parsing can hit
detect_errorin dead branches while another branch reaches accept cleanly. Previously the qmd reader reported diagnostics wheneverlog_observer.had_errors()was true, including the speculative errors from those dead branches, even when the resulting tree was free of ERROR nodes.Use the presence of
ERRORnodes in the final tree as the ground truth for whether the parse actually failed. Whenhad_errors()is true butcollect_error_node_ranges(&tree)is empty, fall through to the success path.Visible fix:
*a" b."*now parses cleanly as<em>a"b."</em>(the two"form a paired double-quote span); previously it emitted a spurious "unclosed*" diagnostic for the outer emphasis.Tests
New regression file
crates/pampa/tests/test_glr_dead_branch_speculation.rswithnested_double_quote_inside_emphasis_parses_cleanlyverifying*a" b."*produces no diagnostics. Verified failing on main HEAD prior to the fix, passing after.cargo nextest run -p pampa→ 3760 passed.