fix!: make Document.id deterministic regardless of meta key order#11446
fix!: make Document.id deterministic regardless of meta key order#11446Aarkin7 wants to merge 3 commits into
Conversation
|
@Aarkin7 is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||
biswajeetdev
left a comment
There was a problem hiding this comment.
The core fix is right — str(dict) includes Python insertion order in its output, so two Document objects with identical content and logically identical meta dicts could get different IDs just because the dicts were constructed differently. json.dumps(..., sort_keys=True) produces a canonical string that is order-independent.
One thing worth calling out in the release note: the default=str argument also affects documents with non-JSON-serialisable meta values (e.g. datetime objects, custom classes). Previously str({"date": datetime(2024,1,1)}) produced "datetime.datetime(2024, 1, 1, 0, 0)" in the hash; now json.dumps(..., default=str) produces "2024-01-01 00:00:00". These are different strings, so any documents with non-serialisable meta fields will also get new IDs — not just documents where key order varied. The release note currently says only non-empty-meta documents are affected, which is true but understates the impact for users with datetime or custom-type meta.
The if self.meta else "{}" guard correctly matches str({}) == "{}" for empty-dict meta, so the "empty meta is unaffected" claim holds.
|
Thanks for catching this. You're right that the impact wording was too narrow. Updated the upgrade note to spell out both cases explicitly:
|
|
Hi @Aarkin7 Thank you for opening this pull request. I agree that a document id should be independent of meta key order. However, it's a big breaking change for production users who have indexed documents and rely on document ids to check for duplicates when re-indexing documents. We'll discuss if that change could be postponed until the next major release of Haystack (version 3.0) and be grouped with any other breaking changes that are necessary. Right now, I believe users benefit more from the stability than from making the ID meta key order-independent. |
The hash was built from dict's repr, which reflects insertion order, so two Documents with equal meta could get different IDs. Serialize meta with sort_keys=True before hashing. Empty-meta IDs are unchanged.
Two BDD scenarios pinned IDs that were computed from documents with non-empty meta, so the deterministic-id fix changes them. Recompute and update the expected values; no behavior change.
496d262 to
ff2fe99
Compare
|
We decided to postpone that fix until the next major Haystack release, which is version 3.0. I changed the target of your PR therefore to our v3 branch. We'll need to document in more detail what users need to be aware of when they migrate from v2 to v3 before merging the PR. We'll take it from here. Thanks again! |
Related Issues
Proposed Changes:
Fixes a bug where Document.id depended on the insertion order of keys in meta. The hash was built from dict's repr, which reflects insertion order, so two Documents with the same content and the same metadata could end up with different IDs, silently breaking DuplicatePolicy.SKIP/FAIL and any cache keyed on the document ID.
Serializing meta with json.dumps(..., sort_keys=True) before hashing makes the ID order-independent. Empty-meta IDs are
preserved by keeping the legacy "{}" string, so only documents with non-empty meta get new IDs.
How did you test it?
Notes for the reviewer
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.