Skip to content

Optimize trie builder#15977

Open
gf2121 wants to merge 8 commits intoapache:mainfrom
gf2121:speed_up_TrieBuilder
Open

Optimize trie builder#15977
gf2121 wants to merge 8 commits intoapache:mainfrom
gf2121:speed_up_TrieBuilder

Conversation

@gf2121
Copy link
Copy Markdown
Contributor

@gf2121 gf2121 commented Apr 22, 2026

This PR speeds up TrieBuilder and reduces its memory footprint by replacing the in-memory object tree with a compact prefix-coded byte buffer during the building phase, and using a frontier-based approach during the saving phase.

Previously, TrieBuilder constructed a large in-memory tree using Node objects. This approach was memory-intensive (O(total nodes * ~120 bytes per node)) and caused massive object allocations, which made it incredibly slow when dealing with large terms.

Main Changes

  • Compact Byte Buffer (Building Phase): Replaced the object-heavy Node and SaveFrame classes with a sequential ByteBuffersDataOutput buffer. Entries are now prefix-encoded (storing prefix length, suffix length, and suffix bytes) and appended sequentially.
  • Zero-Overhead Appends: The first non-empty key (minKey) is stored separately. This allows the append() method to re-encode only the first entry and bulk-copy the remaining bytes with zero per-entry overhead.
  • Frontier-Based Save (Saving Phase): Reconstructs the trie structure on-the-fly during saveNodes() using a FrontierNode array bounded by maxKeyDepth, rather than requiring the whole tree to exist in memory.
  • Removed Status Management: Remove status management because nothing got destroyed after append or save.
  • No File Format Changes: The on-disk file format remains completely unchanged, ensuring backward compatibility.

Memory & Performance Impact

  • Memory Usage: Drops from O(total nodes) to O(total encoded bytes) during building, and down to O(max key depth) during the save operation.
  • Speed: Drastically reduces GC pressure and object allocations, bringing a massive speedup especially for very long terms.

@gf2121 gf2121 requested review from mikemccand and romseygeek April 22, 2026 16:50
@github-actions github-actions Bot modified the milestones: 11.0.0, 10.5.0 Apr 22, 2026
@gf2121
Copy link
Copy Markdown
Contributor Author

gf2121 commented Apr 23, 2026

I get some number on building index of 1M uuids.

FINAL STATS (1 Million 16-byte UUIDs)

Metric Old Trie New Trie FST FST (No Suffix Sharing)
Memory Used (Before Saving) 673.06 MB 20.26 MB 57.57 MB 32.06 MB
Append Time 298.82 ms 166.27 ms 4587.17 ms 741.30 ms
Save Time 217.39 ms 218.97 ms 49.28 ms 46.27 ms
Total Time 516.21 ms 385.23 ms 4636.45 ms 787.57 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant