Skip to content

fix(pampa): treat parse as clean when tree has no ERROR nodes#218

Closed
rundel wants to merge 1 commit into
quarto-dev:mainfrom
rundel:fix/glr-speculation-dead-branch
Closed

fix(pampa): treat parse as clean when tree has no ERROR nodes#218
rundel wants to merge 1 commit into
quarto-dev:mainfrom
rundel:fix/glr-speculation-dead-branch

Conversation

@rundel
Copy link
Copy Markdown
Contributor

@rundel rundel commented May 20, 2026

This fix was buried in #217 - surfacing it as its own PR.

GLR speculative parsing can hit detect_error in dead branches while another branch reaches accept cleanly. Previously the qmd reader reported diagnostics whenever log_observer.had_errors() was true, including the speculative errors from those dead branches, even when the resulting tree was free of ERROR nodes.

Use the presence of ERROR nodes in the final tree as the ground truth for whether the parse actually failed. When had_errors() is true but collect_error_node_ranges(&tree) is empty, fall through to the success path.

Visible fix: *a" b."* now parses cleanly as <em>a"b."</em> (the two " form a paired double-quote span); previously it emitted a spurious "unclosed *" diagnostic for the outer emphasis.

Tests

New regression file crates/pampa/tests/test_glr_dead_branch_speculation.rs with nested_double_quote_inside_emphasis_parses_cleanly verifying *a" b."* produces no diagnostics. Verified failing on main HEAD prior to the fix, passing after.

cargo nextest run -p pampa → 3760 passed.

GLR speculative parsing can hit `detect_error` in dead branches while
another branch reaches accept cleanly. Previously the qmd reader
reported diagnostics whenever `log_observer.had_errors()` was true,
including the speculative errors from those dead branches, even when
the resulting tree was free of ERROR nodes.

Use the presence of `ERROR` nodes in the final tree as the ground
truth for whether the parse actually failed. When `had_errors()` is
true but `collect_error_node_ranges(&tree)` is empty, fall through to
the success path.

Visible fix: `*a" b."*` now parses cleanly as `<em>a"b."</em>`
(the two `"` form a paired double-quote span); previously it emitted
a spurious "unclosed `*`" diagnostic for the outer emphasis.

Isolated from bugfix/quoted-emphasis-word (commit a01e3bc, where it
was folded in alongside the Merr key extension as a side fix); the
main inline-scope misclassification work is unaffected and stays on
that branch.

Tests
-----
New regression file `crates/pampa/tests/test_glr_dead_branch_speculation.rs`
with `nested_double_quote_inside_emphasis_parses_cleanly` verifying
`*a" b."*` produces no diagnostics. Verified failing on main HEAD
prior to the fix, passing after.

`cargo nextest run -p pampa` → 3760 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cscheid
Copy link
Copy Markdown
Member

cscheid commented May 20, 2026

We need to be a little careful here, and I want to think through this thoroughly.

I think this isn't a valid parse; I think what you're seeing is error correction. Starting quotes cannot be followed by spaces (and ending quotes cannot be trailed by spaces):

Valid: "hello"
Invalid: " hello", " hello ", "hello ", etc.

My understanding was that tree-sitter only does GLR-speculative execution when the grammar has a non-empty conflicts field, which I have been careful to avoid. grammar.js doesn't have a conflicts field, and so I don't know where this GLR speculation is coming from.

tree-sitter also uses error correction, which can hide errors. But it can also produce bad tree-sitter nodes, and I want to avoid that (because it makes downstream processing much harder, since we can't assume much about the shape of the tree). I think that's what you're seeing.

@cscheid
Copy link
Copy Markdown
Member

cscheid commented May 20, 2026

tree-sitter parse -d is our friend here:

$ echo '*a" b."*' | tree-sitter parse -d                                                                                                                                                                                       ...
new_parse
...
detect_error lookahead:_whitespace
resume version:0
recover_with_missing symbol:pandoc_str_token1, state:684
...
recover_eof
select_smaller_error symbol:document, over_symbol:ERROR
done
(document [0, 0] - [1, 0]
  (section [0, 0] - [1, 0]
    (pandoc_paragraph [0, 0] - [1, 0]
      (pandoc_emph [0, 0] - [0, 8]
        (emphasis_delimiter [0, 0] - [0, 1])
        (pandoc_str [0, 1] - [0, 2])
        (pandoc_double_quote [0, 2] - [0, 7]
          (double_quote [0, 2] - [0, 3])
          (content [0, 3] - [0, 6]
            (pandoc_str [0, 3] - [0, 3])
            (pandoc_space [0, 3] - [0, 4])
            (pandoc_str [0, 4] - [0, 6]))
          (double_quote [0, 6] - [0, 7]))
        (emphasis_delimiter [0, 7] - [0, 8])))))

In contrast, the grammatically-valid input doesn't suffer from this:

$ echo '*a "b."*' | tree-sitter parse -d                                                                                                                                                                                       Warning: You have not configured any parser directories!
Please run `tree-sitter init-config` and edit the resulting
configuration file to indicate where we should look for
language grammars.

new_parse
...
reduce sym:_newline, child_count:1
reduce sym:pandoc_paragraph, child_count:2
reduce sym:document_repeat1, child_count:1
reduce sym:document, child_count:1
accept
done
(document [0, 0] - [1, 0]
  (section [0, 0] - [1, 0]
    (pandoc_paragraph [0, 0] - [1, 0]
      (pandoc_emph [0, 0] - [0, 8]
        (emphasis_delimiter [0, 0] - [0, 1])
        (pandoc_str [0, 1] - [0, 2])
        (pandoc_double_quote [0, 2] - [0, 7]
          (double_quote [0, 2] - [0, 4])
          (content [0, 4] - [0, 6]
            (pandoc_str [0, 4] - [0, 6]))
          (double_quote [0, 6] - [0, 7]))
        (emphasis_delimiter [0, 7] - [0, 8])))))

So what you're seeing isn't speculative GLR-execution. It's that tree-sitter borrows the same speculation mechanism to perform error correction. IIRC, It has a collection of heuristics like "skip next token" or "change token to one that would be parsed correctly", etc. In those cases, it "invents" a speculative execution path with a different token stream instead of forking on ambiguous rules.

We definitely don't want to support that, and the error message was originally correct.

I'm actually going to go ahead and close this, but we can continue a discussion!

@cscheid cscheid closed this May 20, 2026
@rundel
Copy link
Copy Markdown
Contributor Author

rundel commented May 20, 2026

Interesting I had missed the bit about " being invalid syntax - based on your examples above it seems like only the leading space is an issue, the trailing space(s) get absorbed by the closing ". This all seems a bit weird but it is mostly consistent with pandoc.

pampa behavior
$ printf '"hello"\n'   | cargo run --bin pampa --quiet --

[ Para [Quoted DoubleQuote [Str "hello"]] ]%    
$ printf '" hello"\n'  | cargo run --bin pampa --quiet --

Error: [Q-2-11] Unclosed Double Quote
   ╭─[ <stdin>:1:2 ]
   │
 1 │ " hello"
   │ ┬┬  
   │ ╰─── This is the opening quote mark.
   │  │  
   │  ╰── I reached the end of the block before finding a closing '"' for the quote.
───╯

Error: [Q-2-11] Unclosed Double Quote
   ╭─[ <stdin>:1:9 ]
   │
 1 │ " hello"
   │ ┬       ┬  
   │ ╰────────── This is the opening quote mark.
   │         │  
   │         ╰── I reached the end of the block before finding a closing '"' for the quote.
───╯
$ printf '"hello "\n'  | cargo run --bin pampa --quiet --

[ Para [Quoted DoubleQuote [Str "hello"]] ]%   
$ printf '" hello "\n' | cargo run --bin pampa --quiet --

Error: [Q-2-11] Unclosed Double Quote
   ╭─[ <stdin>:1:2 ]
   │
 1 │ " hello "
   │ ┬┬  
   │ ╰─── This is the opening quote mark.
   │  │  
   │  ╰── I reached the end of the block before finding a closing '"' for the quote.
───╯

Error: [Q-2-11] Unclosed Double Quote
   ╭─[ <stdin>:1:10 ]
   │
 1 │ " hello "
   │ ┬        ┬  
   │ ╰─────────── This is the opening quote mark.
   │          │  
   │          ╰── I reached the end of the block before finding a closing '"' for the quote.
───╯
printf '"hello  "\n' | cargo run --bin pampa --quiet -- -v

document: {Node document (0, 0) - (1, 0)}
  section: {Node section (0, 0) - (1, 0)}
    pandoc_paragraph: {Node pandoc_paragraph (0, 0) - (1, 0)}
      pandoc_double_quote: {Node pandoc_double_quote (0, 0) - (0, 9)}
        double_quote: {Node double_quote (0, 0) - (0, 1)}
        content: {Node content (0, 1) - (0, 6)}
          pandoc_str: {Node pandoc_str (0, 1) - (0, 6)}
        double_quote: {Node double_quote (0, 6) - (0, 9)}
[ Para [Quoted DoubleQuote [Str "hello"]] ]%  
pandoc behavior
echo -e '"hello"' | pandoc -t native
[ Para [ Quoted DoubleQuote [ Str "hello" ] ] ]
echo -e '" hello"' | pandoc -t native
[ Para [ Str "\8221" , Space , Str "hello\8221" ] ]
echo -e '"hello "' | pandoc -t native
[ Para [ Quoted DoubleQuote [ Str "hello" ] ] ]
echo -e '" hello "' | pandoc -t native
[ Para
    [ Str "\8221" , Space , Str "hello" , Space , Str "\8221" ]
]

Areas where I am seeing inconsistency:

  • Pandoc doesn't throw an error it just interprets the double quote as a " literal
  • Pandoc expects a Quoted DoubleQuote to be preceeded by a space (or line start) which does not appear to be the case with pampa (ab"c"de parses differently between the two)
  • Pampa seems to only allow a literal " via escaping

Are these specific design choices or side effects of the parser(s)?

@cscheid
Copy link
Copy Markdown
Member

cscheid commented May 20, 2026

Pandoc doesn't throw an error it just interprets the double quote as a " literal

FYI, Pandoc never errors in parsing. It considers this a feature (including enshrining this into the CommonMark spec). I consider it a bad decision.

(ab"c"de parses differently between the two)

That's interesting. I hadn't noticed it and I might be persuaded to change the parser. I think the current behavior is there so that punctuation works well in languages that put punctuation after quotes, like "so".

Pampa seems to only allow a literal " via escaping

Are these specific design choices or side effects of the parser(s)?

Fair question, and the answer is annoyingly in the middle. So, let's say they are "general design choices". I intentionally want typos to be syntax errors. Is ab"c"de:

  1. a typo?
  2. an intentional introduction of double quote characters in a single "word"?
  3. an intentional introduction of double quotations without the intervening space?

Clearly we're assuming option 3; Pandoc assumes 2, and maybe we should be thinking it's 1?

@cscheid
Copy link
Copy Markdown
Member

cscheid commented May 20, 2026

I should also warn (again?) that the harder you look at Markdown, the more you'll start losing your mind...

@rundel
Copy link
Copy Markdown
Contributor Author

rundel commented May 20, 2026

Yeah - I had purposed avoided dealing with as much markdown syntax as possible in my previous sojourns and I had similar wtf moments when working on the md4c / md4r stuff and trying to reconcile with the spec documents.

My naive two cents - I think I prefer the pampa approach of " being syntactically meaningful all the time and the failing over to the literal seems like a bad idea. I would also argue for modifying the parser so that a space before the opening " is necessary - I'm not sure how to cleanly deal with the closing quote and possible punctuation, that seems fraught. I would also say that the closing " consuming the proceeding spaces is not good and that should also be a error rather than getting silently resolved. Most of the above then also need more specific error codes to make it clearer what exactly went wrong and how to fix things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants