Skip to content

Backfill file_id for entries imported before #505 migration #513

@lgelauff

Description

@lgelauff

After the #505 migration, entries imported before the file_id column was added have file_id = NULL. A backfill would populate these by looking up each entry's filename against commonswiki_p.file on the wikireplica.

It is not clear this is the right thing to do. Key concerns:

  • The only available lookup key is the filename, but file_id exists precisely because filenames are not stable. A file renamed since import would produce no match (safe — leave NULL) or potentially a wrong match if the name was reused by a different file.
  • A sanity check (uploader, upload date) would reduce but not eliminate the risk of wrong matches.
  • NULL file_id on old entries is a known and documented state, not a bug. The question is whether the operational benefit of having file_id on historical entries outweighs the risk of silently assigning wrong IDs.

If backfilling is pursued, the script must:

  1. Only update entries where file_id IS NULL
  2. Process entries in batches to avoid overloading the wikireplica
  3. Match by filename against commonswiki_p.file
  4. Cross-check uploader and upload date as a sanity guard
  5. Log all non-matches for manual review
  6. Be idempotent

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions