Skip to content

Migrate loaders to support file_id and enforce wikireplica enrichment (#505, #509, #510, #513, #516)#520

Open
ayushshukla1807 wants to merge 36 commits intohatnote:masterfrom
ayushshukla1807:migrate-image-to-filerevision
Open

Migrate loaders to support file_id and enforce wikireplica enrichment (#505, #509, #510, #513, #516)#520
ayushshukla1807 wants to merge 36 commits intohatnote:masterfrom
ayushshukla1807:migrate-image-to-filerevision

Conversation

@ayushshukla1807
Copy link
Copy Markdown

🔍 Context & Problem Statement

The ongoing file & filerevision dataset migration highlighted several tooling limitations over CSV import methods and environment behavior:

  1. VITE_API_ENDPOINT fallback failed during production builds (CORS/URL rendering).
  2. Existing full CSV imports ignored labs.py resulting in stale metadata lacking file_ids completely.
  3. CSV Loaders only utilized filename indexing strings, entirely excluding file_id lookups.

🛠️ Solution

  1. Configured strict frontend/.env.production ensuring base paths mount securely.
  2. Built robust chunked ID loaders load_id_list & /utils/file_id using stable schema keys.
  3. Radically refactored load_full_csv ensuring dynamic propagation across wikireplica enrichment loaders mapping 1:1 with partial CSV pipelines.
  4. Appended tools/backfill_file_id.py batch runner for legacy system entries missing identifiers.

Closes #516
Closes #513
Closes #510
Closes #509
Closes #505

lgelauff and others added 30 commits April 17, 2026 19:28
The `image` and `oldimage` tables are being removed from Wikimedia
wikireplicas on 28 May 2026 (T28741). Rewrites `get_files()` and
`get_file_info()` in `labs.py` to use the new `file`, `filerevision`,
and `filetypes` tables. All output key aliases (`img_*`, `oi_archive_name`,
`rec_img_*`) are preserved so downstream code is unaffected.

Also adds `file_id` (the Commons file identity) as a nullable column on
the `Entry` model, threaded through `make_entry()` for all new imports.

Key implementation notes:
- GROUP BY non-determinism fixed: replaced with a correlated MIN(fr_id)
  subquery that deterministically selects the earliest non-deleted revision
  per file (original uploader / original timestamp).
- `get_files_legacy()` is a temporary verbatim copy of the old query,
  kept alive solely to power `test_get_files_parity` (xfail) while both
  table sets are live. Remove together with that test after 28 May 2026.
- The new `filetypes` LEFT JOIN can produce NULL MIME values (the old
  `image` table never did). Files with unknown MIME are kept (stored as
  NULL) rather than silently dropped. This surfaces a pre-existing bug in
  `autodisqualify_by_filetype()` where SQL NOT IN on NULL evaluates to
  NULL, silently skipping disqualification — left as a follow-up issue.
- Toolforge runs MariaDB 10.6.22 (partial unique indexes require 10.8+),
  so the migration SQL uses a plain non-unique index on `file_id`;
  import-time uniqueness enforced at the application layer.

Migration SQL: `.claude/features/migrate-image-to-filerevision/migration.sql`
Run on production before deploying.

Closes hatnote#504
Refs T28741 (https://phabricator.wikimedia.org/T28741)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add AND fr.fr_deleted = 0 to the main filerevision JOIN in both
get_files() and get_file_info(). In the old image table schema the
current revision was always live; in filerevision, file_latest can
theoretically point to a suppressed revision. Without this guard such
a file would be excluded from results by the inner JOIN rather than
returning NULL metadata — a regression introduced by the migration.

Also add a comment to make_entry() explaining that CSV-imported entries
intentionally receive file_id=NULL (file_id only comes from wikireplica
imports, not the CSV path).

Refs hatnote#505

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… 12-step procedure

migrate_beta_db.py:
- Check both column AND index for idempotency (previously only checked column)
- Wrap both DDL steps in `with c:` for atomicity — both succeed or neither does
- Add try/finally to ensure connection is closed

revert_beta_db.py:
- Fast path (SQLite ≥3.35): wrap DROP INDEX + DROP COLUMN in explicit BEGIN/COMMIT
  using isolation_level=None (required for manual transaction control)
- Slow path (<3.35): replace broken CREATE TABLE AS SELECT (loses PK/constraints/indexes)
  with proper SQLite 12-step schema-alteration procedure:
  reads original schema from sqlite_master, strips file_id, creates entries_new,
  copies data, drops old table, renames, recreates all other indexes/triggers,
  runs PRAGMA foreign_key_check, commits atomically
- Verified locally: pre-migration schema/indexes/data identical to post-revert

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
python-graph-core uses pkg_resources namespace packages which requires
setuptools to be explicitly installed on Python 3.13 (it is no longer
bundled). Tracked in hatnote#421.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
setuptools 82.0.0 removed pkg_resources entirely. python-graph-core
declares pygraph as a namespace package via pkg_resources, causing
ModuleNotFoundError on startup. Pin to 81.0.0 until python-graph-core
is replaced or updated (hatnote#421).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
file_id is the stable Commons identifier introduced in the
image→filerevision migration (hatnote#504). Including it in the API
response makes it available to clients for deduplication and
rename tracking.

Inspired by @ayushshukla1807's work in hatnote#506.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Entry.to_details_dict() and to_export_dict() now include file_id,
  making it available in all API responses that return entry data
- Add test_import_entries_have_file_id: end-to-end regression test
  verifying file_id survives the full import pipeline
  (labs.py → make_entry() → DB → to_export_dict() → API response)

Inspired by @ayushshukla1807's work in hatnote#506.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add tools/migrate_prod_db.sql and revert_prod_db.sql for production
  MySQL migration (idempotent, exact inverses of each other)
- Add comment to labs.py clarifying that alias `oi` mirrors the old
  oldimage role but is a filerevision subquery, not the oldimage table
- Change fixture file_ids (conftest.py) for SELECTED and REUPLOAD from
  99999/88888 to 1/2, structurally below the FIXTURE_FILE_INFOS range
  (1000+) so collisions are impossible regardless of future additions

Part of hatnote#505.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix tablename: actual SQLAlchemy model uses `entries` not `entry`
- Rename index to ix_entries_file_id to match tablename convention
- Both scripts remain idempotent (IF NOT EXISTS / IF EXISTS)

Part of hatnote#505.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use os.path.expanduser() instead of hardcoded /root path,
matching the pattern in labs.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reports schema state (file_id column and index presence), entry
coverage, and get_files() query time against a live category.
Run before and after migrate_prod_db.sql for hatnote#514.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
No credentials needed as arguments — uses load_env_config()
to read config.beta.yaml, matching how the app connects.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use PRAGMA instead of information_schema when db_url is SQLite,
so the script works before the MariaDB config switch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Global pre-commit hook at ~/.claude/hooks/pre-commit blocks commits
containing detected secrets. This baseline suppresses false positives
for this repo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Avoids embedding any credential placeholder in the script, removing
the need for a pragma comment that would suppress detection even for
real passwords. Related: hatnote#514

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reads credentials from replica.my.cnf at runtime — no credentials
in the script. Avoids terminal line-wrapping on Toolforge.
Related: hatnote#514

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without -h, mysql defaults to a local socket which doesn't exist
on the Toolforge bastion. Related: hatnote#514

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drops all tables and recreates from current SQLAlchemy models.
Beta only — no real data to preserve. hatnote#514

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lgelauff and others added 6 commits April 19, 2026 12:55
MariaDB enforces foreign key constraints during DROP TABLE unlike
SQLite. Disable before drop, re-enable after. hatnote#514

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Safer approach: use metadata.sorted_tables reversed to drop child
tables before parents. No need to disable FK constraints.
hatnote#514

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants