Backfill file_id for entries imported before #505 migration

After the #505 migration, entries imported before the `file_id` column was added have `file_id = NULL`. A backfill would populate these by looking up each entry's filename against `commonswiki_p.file` on the wikireplica.

**It is not clear this is the right thing to do.** Key concerns:

- The only available lookup key is the filename, but `file_id` exists precisely because filenames are not stable. A file renamed since import would produce no match (safe — leave NULL) or potentially a wrong match if the name was reused by a different file.
- A sanity check (uploader, upload date) would reduce but not eliminate the risk of wrong matches.
- NULL `file_id` on old entries is a known and documented state, not a bug. The question is whether the operational benefit of having `file_id` on historical entries outweighs the risk of silently assigning wrong IDs.

If backfilling is pursued, the script must:
1. Only update entries where `file_id IS NULL`
2. Process entries in batches to avoid overloading the wikireplica
3. Match by filename against `commonswiki_p.file`
4. Cross-check uploader and upload date as a sanity guard
5. Log all non-matches for manual review
6. Be idempotent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backfill file_id for entries imported before #505 migration #513

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Backfill file_id for entries imported before #505 migration #513

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions