Resolve semantic identifiers in wiki reverse-link parser#23201
Conversation
c0c279a to
9cf5e72
Compare
|
Previous 👍🏾 #22976 (comment) |
Deploying openproject with ⚡ PullPreview
|
The wiki module maintains a separate work-package macro parser that produces formal database links from wiki pages back to referenced work packages. It still spoke numeric-only and silently dropped semantic identifiers, leaving reverse-link coverage uneven once a project is in semantic mode. The matcher now accepts the same shape as the inline-text macro (`#NNN`, `#PROJ-1`, plus the `##`/`###` widget variants) and resolves each capture through `WorkPackage.where_display_id_in`, which already mixes primary keys, current identifiers and historical aliases in one query. A rename history continues to produce a reverse link via the alias table.
The wiki reverse-link parser reads identifiers straight out of saved wiki page text. Without a bound, a multi-megabyte pasted body could push thousands of values into the alias-aware WP lookup in one query. `MAX_PRELOAD_IDENTIFIERS = 500` caps the per-save lookup; references past the cap simply don't get a reverse link recorded.
WP_REF_RE applies `(?!\w)` only to the semantic alternation branch — the numeric branch is intentionally unbounded so historic shapes like `#13-blubb` keep matching. There was no spec pinning the historic behaviour, so a future tightening of the boundary could silently strip reverse-links from existing wiki content.
The semantic-shape branch of WP_REF_RE inlined the literal pattern `[A-Z][A-Z0-9_]*-\d+` instead of composing from `WorkPackage::SemanticIdentifier::SEMANTIC_ID_PATTERN.source`. A future tightening of the upstream pattern would silently drift away from the wiki parser's shape unless someone hand-edited this regex too. The multi-line /x form also restores the readable structure: prefix class on its own line, branches separated on the alternation, and the `(?!\w)` boundary visible next to the branch it constrains.
0d6c7e2 to
2650b65
Compare
The 500-identifier cap was added defensively but silently truncated references past the boundary with no signal to the author. The prior numeric-only baseline had no cap at all and never showed a problem in practice; PostgreSQL handles large IN-lists comfortably at the scales realistically seen in a wiki body. If extreme volumes ever do surface, the per-row INSERT loop is the true bottleneck, not the SELECT -- a problem worth solving on its own terms rather than masking here. Constants now sit at the top of the module, above the method body, keeping the Ruby style convention.
thykel
left a comment
There was a problem hiding this comment.
🚀 Very nice! Leaving you with some fairly low-priority comments.
| # Mirrors the prefix character class of the inline-text macro matcher. | ||
| # The trailing `(?!\w)` on the semantic branch keeps `#PROJ-1abc` from | ||
| # matching `#PROJ-1`; the numeric branch deliberately has no trailing | ||
| # boundary to preserve historic behaviour for inputs like `#13-blubb`. |
There was a problem hiding this comment.
The lack of boundary for numeric IDs is interesting and worth a separate non-blocking discussion. Do we actually want to retain that behaviour?
There was a problem hiding this comment.
I'm not sure whether we should retain it, but I'm hesitant to disrupt it in this swing. This test shows how it can be used in classic mode i.e. "Trailing: #1234abc"
| # matching `#PROJ-1`; the numeric branch deliberately has no trailing | ||
| # boundary to preserve historic behaviour for inputs like `#13-blubb`. | ||
| # rubocop:disable Style/RedundantRegexpEscape | ||
| WP_REF_RE = / |
| find_wp_links(wiki_page.text).uniq.each do |wp_id| | ||
| wp = WorkPackage.find_by(id: wp_id) | ||
| next if wp.nil? | ||
| identifiers = find_wp_links(wiki_page.text).uniq |
There was a problem hiding this comment.
Nit: Should we also somehow uniq distinct references pointing to the same WP? e.g. scenarios where 123, OLDPROJ-10 and NEWPROJ-5 all resolve to the same record.
There was a problem hiding this comment.
Nice catch, I suppose we can take care of that at the where_display_id_in scope- as 123, OLDPROJ-10 and NEWPROJ-5 should resolve to the same work package record
There was a problem hiding this comment.
🤖 TDD'd this — the failing test went green on first run, so no production change.
Added 22c70f9 as regression coverage: a body with #<id>, #<current-identifier>, and #OLD-<seq> (alias) all pointing at the same WP produces exactly one reverse-link row.
The reason it already works: WorkPackage.where_display_id_in composes the three lookups as WHERE id IN (…) OR identifier IN (…) OR EXISTS(alias …). Single SELECT over work_packages, set semantics, so find_each yields the WP once and create! fires once. Adding .distinct on the scope would force a redundant SELECT DISTINCT pass on every call without changing the result.
Adds a single example exercising both reference shapes in one wiki body, guarding against any future regex divergence that drops one branch when the other matches first on the same line.
A wiki body referencing the same work package via its primary key, current semantic identifier, and a historical alias must still produce one reverse-link row. Where_display_id_in's OR-composed relation already guarantees this; the spec keeps it that way.
1e8cec1 to
22c70f9
Compare
Ticket
https://community.openproject.org/wp/74947 Split out of #22976.
What are you trying to accomplish?
The wiki module maintains a parser that scans saved wiki pages for work-package references and creates formal database links from each wiki page back to the referenced WPs (so the WP's "referenced from" tab populates). The parser was numeric-only — semantic-shape references like
#PROJ-1were silently dropped, leaving reverse-link coverage uneven once a project is in semantic mode.The matcher now accepts the same shape as the inline-text macro —
#NNN,#PROJ-1, plus the##/###widget variants — and resolves each capture throughWorkPackage.where_display_id_in, which mixes primary keys, current identifiers and historical aliases in one query. A rename history continues to produce a reverse link via the alias table.Screenshots
Merge checklist