After the #505 migration, entries imported before the file_id column was added have file_id = NULL. A backfill would populate these by looking up each entry's filename against commonswiki_p.file on the wikireplica.
It is not clear this is the right thing to do. Key concerns:
- The only available lookup key is the filename, but
file_id exists precisely because filenames are not stable. A file renamed since import would produce no match (safe — leave NULL) or potentially a wrong match if the name was reused by a different file.
- A sanity check (uploader, upload date) would reduce but not eliminate the risk of wrong matches.
- NULL
file_id on old entries is a known and documented state, not a bug. The question is whether the operational benefit of having file_id on historical entries outweighs the risk of silently assigning wrong IDs.
If backfilling is pursued, the script must:
- Only update entries where
file_id IS NULL
- Process entries in batches to avoid overloading the wikireplica
- Match by filename against
commonswiki_p.file
- Cross-check uploader and upload date as a sanity guard
- Log all non-matches for manual review
- Be idempotent
After the #505 migration, entries imported before the
file_idcolumn was added havefile_id = NULL. A backfill would populate these by looking up each entry's filename againstcommonswiki_p.fileon the wikireplica.It is not clear this is the right thing to do. Key concerns:
file_idexists precisely because filenames are not stable. A file renamed since import would produce no match (safe — leave NULL) or potentially a wrong match if the name was reused by a different file.file_idon old entries is a known and documented state, not a bug. The question is whether the operational benefit of havingfile_idon historical entries outweighs the risk of silently assigning wrong IDs.If backfilling is pursued, the script must:
file_id IS NULLcommonswiki_p.file