Migrate loaders to support file_id and enforce wikireplica enrichment (#505, #509, #510, #513, #516) by ayushshukla1807 · Pull Request #520 · hatnote/montage

ayushshukla1807 · 2026-04-20T23:18:46Z

🔍 Context & Problem Statement

The ongoing file & filerevision dataset migration highlighted several tooling limitations over CSV import methods and environment behavior:

VITE_API_ENDPOINT fallback failed during production builds (CORS/URL rendering).
Existing full CSV imports ignored labs.py resulting in stale metadata lacking file_ids completely.
CSV Loaders only utilized filename indexing strings, entirely excluding file_id lookups.

🛠️ Solution

Configured strict frontend/.env.production ensuring base paths mount securely.
Built robust chunked ID loaders load_id_list & /utils/file_id using stable schema keys.
Radically refactored load_full_csv ensuring dynamic propagation across wikireplica enrichment loaders mapping 1:1 with partial CSV pipelines.
Appended tools/backfill_file_id.py batch runner for legacy system entries missing identifiers.

Closes #516
Closes #513
Closes #510
Closes #509
Closes #505

The `image` and `oldimage` tables are being removed from Wikimedia wikireplicas on 28 May 2026 (T28741). Rewrites `get_files()` and `get_file_info()` in `labs.py` to use the new `file`, `filerevision`, and `filetypes` tables. All output key aliases (`img_*`, `oi_archive_name`, `rec_img_*`) are preserved so downstream code is unaffected. Also adds `file_id` (the Commons file identity) as a nullable column on the `Entry` model, threaded through `make_entry()` for all new imports. Key implementation notes: - GROUP BY non-determinism fixed: replaced with a correlated MIN(fr_id) subquery that deterministically selects the earliest non-deleted revision per file (original uploader / original timestamp). - `get_files_legacy()` is a temporary verbatim copy of the old query, kept alive solely to power `test_get_files_parity` (xfail) while both table sets are live. Remove together with that test after 28 May 2026. - The new `filetypes` LEFT JOIN can produce NULL MIME values (the old `image` table never did). Files with unknown MIME are kept (stored as NULL) rather than silently dropped. This surfaces a pre-existing bug in `autodisqualify_by_filetype()` where SQL NOT IN on NULL evaluates to NULL, silently skipping disqualification — left as a follow-up issue. - Toolforge runs MariaDB 10.6.22 (partial unique indexes require 10.8+), so the migration SQL uses a plain non-unique index on `file_id`; import-time uniqueness enforced at the application layer. Migration SQL: `.claude/features/migrate-image-to-filerevision/migration.sql` Run on production before deploying. Closes hatnote#504 Refs T28741 (https://phabricator.wikimedia.org/T28741) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add AND fr.fr_deleted = 0 to the main filerevision JOIN in both get_files() and get_file_info(). In the old image table schema the current revision was always live; in filerevision, file_latest can theoretically point to a suppressed revision. Without this guard such a file would be excluded from results by the inner JOIN rather than returning NULL metadata — a regression introduced by the migration. Also add a comment to make_entry() explaining that CSV-imported entries intentionally receive file_id=NULL (file_id only comes from wikireplica imports, not the CSV path). Refs hatnote#505 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e#425)

… 12-step procedure migrate_beta_db.py: - Check both column AND index for idempotency (previously only checked column) - Wrap both DDL steps in `with c:` for atomicity — both succeed or neither does - Add try/finally to ensure connection is closed revert_beta_db.py: - Fast path (SQLite ≥3.35): wrap DROP INDEX + DROP COLUMN in explicit BEGIN/COMMIT using isolation_level=None (required for manual transaction control) - Slow path (<3.35): replace broken CREATE TABLE AS SELECT (loses PK/constraints/indexes) with proper SQLite 12-step schema-alteration procedure: reads original schema from sqlite_master, strips file_id, creates entries_new, copies data, drops old table, renames, recreates all other indexes/triggers, runs PRAGMA foreign_key_check, commits atomically - Verified locally: pre-migration schema/indexes/data identical to post-revert Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

python-graph-core uses pkg_resources namespace packages which requires setuptools to be explicitly installed on Python 3.13 (it is no longer bundled). Tracked in hatnote#421. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

setuptools 82.0.0 removed pkg_resources entirely. python-graph-core declares pygraph as a namespace package via pkg_resources, causing ModuleNotFoundError on startup. Pin to 81.0.0 until python-graph-core is replaced or updated (hatnote#421). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@ayushshukla1807

file_id is the stable Commons identifier introduced in the image→filerevision migration (hatnote#504). Including it in the API response makes it available to clients for deduplication and rename tracking. Inspired by @ayushshukla1807's work in hatnote#506. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@ayushshukla1807

- Entry.to_details_dict() and to_export_dict() now include file_id, making it available in all API responses that return entry data - Add test_import_entries_have_file_id: end-to-end regression test verifying file_id survives the full import pipeline (labs.py → make_entry() → DB → to_export_dict() → API response) Inspired by @ayushshukla1807's work in hatnote#506. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add tools/migrate_prod_db.sql and revert_prod_db.sql for production MySQL migration (idempotent, exact inverses of each other) - Add comment to labs.py clarifying that alias `oi` mirrors the old oldimage role but is a filerevision subquery, not the oldimage table - Change fixture file_ids (conftest.py) for SELECTED and REUPLOAD from 99999/88888 to 1/2, structurally below the FIXTURE_FILE_INFOS range (1000+) so collisions are impossible regardless of future additions Part of hatnote#505. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fix tablename: actual SQLAlchemy model uses `entries` not `entry` - Rename index to ix_entries_file_id to match tablename convention - Both scripts remain idempotent (IF NOT EXISTS / IF EXISTS) Part of hatnote#505. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Use os.path.expanduser() instead of hardcoded /root path, matching the pattern in labs.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reports schema state (file_id column and index presence), entry coverage, and get_files() query time against a live category. Run before and after migrate_prod_db.sql for hatnote#514. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

No credentials needed as arguments — uses load_env_config() to read config.beta.yaml, matching how the app connects. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Use PRAGMA instead of information_schema when db_url is SQLite, so the script works before the MariaDB config switch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Global pre-commit hook at ~/.claude/hooks/pre-commit blocks commits containing detected secrets. This baseline suppresses false positives for this repo. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Avoids embedding any credential placeholder in the script, removing the need for a pragma comment that would suppress detection even for real passwords. Related: hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reads credentials from replica.my.cnf at runtime — no credentials in the script. Avoids terminal line-wrapping on Toolforge. Related: hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Without -h, mysql defaults to a local socket which doesn't exist on the Toolforge bastion. Related: hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Drops all tables and recreates from current SQLAlchemy models. Beta only — no real data to preserve. hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

MariaDB enforces foreign key constraints during DROP TABLE unlike SQLite. Disable before drop, re-enable after. hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Safer approach: use metadata.sorted_tables reversed to drop child tables before parents. No need to disable FK constraints. hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…atnote#505)

…note#516)

…ile_id resolution (hatnote#509, hatnote#510)

…te#513)

lgelauff and others added 30 commits April 17, 2026 19:28

Use .get() with or-None fallback for MIME fields in make_entry()

8be81db

Add standalone wikireplica query test script for PR hatnote#505

c6d48b4

Add standalone wikireplica query test script for PR hatnote#505

6d77abe

Update cffi and pycparser versions for Python 3.13 compatibility

fbf19c1

Update deployment.md: python3.9 → python3.13, add --prefer-binary note

67b65d7

Fix categorylinks query: use linktarget join instead of cl_to (hatnot…

8804770

…e#425)

Revert pycparser to 2.22 (3.0 requires Python >=3.10, breaks CI)

842e2b9

Add beta DB migration and revert scripts for PR hatnote#505

1494950

Add beta DB verification script for PR hatnote#505

b0d1e6d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tools: add check_wikireplica_host.py to verify hostname connectivity

c55aaa3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tools: fix replica.my.cnf path in check_wikireplica_host.py

bf98fe9

Use os.path.expanduser() instead of hardcoded /root path, matching the pattern in labs.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tools: add run_test.sh convenience wrapper for test_beta_mariadb.py

981fdc1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tools: read db_url from config in test_beta_mariadb.py

c951ccf

No credentials needed as arguments — uses load_env_config() to read config.beta.yaml, matching how the app connects. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tools: handle SQLite gracefully in test_beta_mariadb.py

a184bcd

Use PRAGMA instead of information_schema when db_url is SQLite, so the script works before the MariaDB config switch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tools: fix tuple index error in SQLite schema display path

ccc9bc9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: add detect-secrets baseline for pre-commit credential scanning

fb21dc9

Global pre-commit hook at ~/.claude/hooks/pre-commit blocks commits containing detected secrets. This baseline suppresses false positives for this repo. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tools: add run_migrate.sh wrapper for migrate_prod_db.sql

6d0b933

Reads credentials from replica.my.cnf at runtime — no credentials in the script. Avoids terminal line-wrapping on Toolforge. Related: hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tools: fix run_migrate.sh to connect via TCP to ToolsDB host

cfe371d

Without -h, mysql defaults to a local socket which doesn't exist on the Toolforge bastion. Related: hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tools: add run_recreate_schema.sh to drop/recreate MariaDB schema

723672a

Drops all tables and recreates from current SQLAlchemy models. Beta only — no real data to preserve. hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lgelauff and others added 6 commits April 19, 2026 12:55

tools: disable FK checks before drop_all in run_recreate_schema.sh

1505c40

MariaDB enforces foreign key constraints during DROP TABLE unlike SQLite. Disable before drop, re-enable after. hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tools: drop tables in reverse FK order instead of disabling FK checks

86c8184

Safer approach: use metadata.sorted_tables reversed to drop child tables before parents. No need to disable FK constraints. hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(backend): implement rename-aware file deduplication via file_id (h…

8a99e4c

…atnote#505)

fix(frontend): set empty VITE_API_ENDPOINT for production builds (hat…

6b667f4

…note#516)

fix(loaders): enforce wikireplica enrichment on full CSVs and allow f…

2d58743

…ile_id resolution (hatnote#509, hatnote#510)

feat(migration): script to backfill file_id for legacy entries (hatno…

79a1304

…te#513)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate loaders to support file_id and enforce wikireplica enrichment (#505, #509, #510, #513, #516)#520

Migrate loaders to support file_id and enforce wikireplica enrichment (#505, #509, #510, #513, #516)#520
ayushshukla1807 wants to merge 36 commits intohatnote:masterfrom
ayushshukla1807:migrate-image-to-filerevision

ayushshukla1807 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ayushshukla1807 commented Apr 20, 2026

🔍 Context & Problem Statement

🛠️ Solution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants