Migrate loaders to support file_id and enforce wikireplica enrichment (#505, #509, #510, #513, #516)#520
Open
ayushshukla1807 wants to merge 36 commits intohatnote:masterfrom
Open
Conversation
The `image` and `oldimage` tables are being removed from Wikimedia wikireplicas on 28 May 2026 (T28741). Rewrites `get_files()` and `get_file_info()` in `labs.py` to use the new `file`, `filerevision`, and `filetypes` tables. All output key aliases (`img_*`, `oi_archive_name`, `rec_img_*`) are preserved so downstream code is unaffected. Also adds `file_id` (the Commons file identity) as a nullable column on the `Entry` model, threaded through `make_entry()` for all new imports. Key implementation notes: - GROUP BY non-determinism fixed: replaced with a correlated MIN(fr_id) subquery that deterministically selects the earliest non-deleted revision per file (original uploader / original timestamp). - `get_files_legacy()` is a temporary verbatim copy of the old query, kept alive solely to power `test_get_files_parity` (xfail) while both table sets are live. Remove together with that test after 28 May 2026. - The new `filetypes` LEFT JOIN can produce NULL MIME values (the old `image` table never did). Files with unknown MIME are kept (stored as NULL) rather than silently dropped. This surfaces a pre-existing bug in `autodisqualify_by_filetype()` where SQL NOT IN on NULL evaluates to NULL, silently skipping disqualification — left as a follow-up issue. - Toolforge runs MariaDB 10.6.22 (partial unique indexes require 10.8+), so the migration SQL uses a plain non-unique index on `file_id`; import-time uniqueness enforced at the application layer. Migration SQL: `.claude/features/migrate-image-to-filerevision/migration.sql` Run on production before deploying. Closes hatnote#504 Refs T28741 (https://phabricator.wikimedia.org/T28741) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add AND fr.fr_deleted = 0 to the main filerevision JOIN in both get_files() and get_file_info(). In the old image table schema the current revision was always live; in filerevision, file_latest can theoretically point to a suppressed revision. Without this guard such a file would be excluded from results by the inner JOIN rather than returning NULL metadata — a regression introduced by the migration. Also add a comment to make_entry() explaining that CSV-imported entries intentionally receive file_id=NULL (file_id only comes from wikireplica imports, not the CSV path). Refs hatnote#505 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… 12-step procedure migrate_beta_db.py: - Check both column AND index for idempotency (previously only checked column) - Wrap both DDL steps in `with c:` for atomicity — both succeed or neither does - Add try/finally to ensure connection is closed revert_beta_db.py: - Fast path (SQLite ≥3.35): wrap DROP INDEX + DROP COLUMN in explicit BEGIN/COMMIT using isolation_level=None (required for manual transaction control) - Slow path (<3.35): replace broken CREATE TABLE AS SELECT (loses PK/constraints/indexes) with proper SQLite 12-step schema-alteration procedure: reads original schema from sqlite_master, strips file_id, creates entries_new, copies data, drops old table, renames, recreates all other indexes/triggers, runs PRAGMA foreign_key_check, commits atomically - Verified locally: pre-migration schema/indexes/data identical to post-revert Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
python-graph-core uses pkg_resources namespace packages which requires setuptools to be explicitly installed on Python 3.13 (it is no longer bundled). Tracked in hatnote#421. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
setuptools 82.0.0 removed pkg_resources entirely. python-graph-core declares pygraph as a namespace package via pkg_resources, causing ModuleNotFoundError on startup. Pin to 81.0.0 until python-graph-core is replaced or updated (hatnote#421). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
file_id is the stable Commons identifier introduced in the image→filerevision migration (hatnote#504). Including it in the API response makes it available to clients for deduplication and rename tracking. Inspired by @ayushshukla1807's work in hatnote#506. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Entry.to_details_dict() and to_export_dict() now include file_id, making it available in all API responses that return entry data - Add test_import_entries_have_file_id: end-to-end regression test verifying file_id survives the full import pipeline (labs.py → make_entry() → DB → to_export_dict() → API response) Inspired by @ayushshukla1807's work in hatnote#506. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add tools/migrate_prod_db.sql and revert_prod_db.sql for production MySQL migration (idempotent, exact inverses of each other) - Add comment to labs.py clarifying that alias `oi` mirrors the old oldimage role but is a filerevision subquery, not the oldimage table - Change fixture file_ids (conftest.py) for SELECTED and REUPLOAD from 99999/88888 to 1/2, structurally below the FIXTURE_FILE_INFOS range (1000+) so collisions are impossible regardless of future additions Part of hatnote#505. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix tablename: actual SQLAlchemy model uses `entries` not `entry` - Rename index to ix_entries_file_id to match tablename convention - Both scripts remain idempotent (IF NOT EXISTS / IF EXISTS) Part of hatnote#505. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use os.path.expanduser() instead of hardcoded /root path, matching the pattern in labs.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reports schema state (file_id column and index presence), entry coverage, and get_files() query time against a live category. Run before and after migrate_prod_db.sql for hatnote#514. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
No credentials needed as arguments — uses load_env_config() to read config.beta.yaml, matching how the app connects. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use PRAGMA instead of information_schema when db_url is SQLite, so the script works before the MariaDB config switch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Global pre-commit hook at ~/.claude/hooks/pre-commit blocks commits containing detected secrets. This baseline suppresses false positives for this repo. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Avoids embedding any credential placeholder in the script, removing the need for a pragma comment that would suppress detection even for real passwords. Related: hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reads credentials from replica.my.cnf at runtime — no credentials in the script. Avoids terminal line-wrapping on Toolforge. Related: hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without -h, mysql defaults to a local socket which doesn't exist on the Toolforge bastion. Related: hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drops all tables and recreates from current SQLAlchemy models. Beta only — no real data to preserve. hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MariaDB enforces foreign key constraints during DROP TABLE unlike SQLite. Disable before drop, re-enable after. hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Safer approach: use metadata.sorted_tables reversed to drop child tables before parents. No need to disable FK constraints. hatnote#514 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ile_id resolution (hatnote#509, hatnote#510)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🔍 Context & Problem Statement
The ongoing file & filerevision dataset migration highlighted several tooling limitations over CSV import methods and environment behavior:
VITE_API_ENDPOINTfallback failed during production builds (CORS/URL rendering).labs.pyresulting in stale metadata lackingfile_ids completely.file_idlookups.🛠️ Solution
frontend/.env.productionensuring base paths mount securely.load_id_list&/utils/file_idusing stable schema keys.load_full_csvensuring dynamic propagation across wikireplica enrichment loaders mapping 1:1 with partial CSV pipelines.tools/backfill_file_id.pybatch runner for legacy system entries missing identifiers.Closes #516
Closes #513
Closes #510
Closes #509
Closes #505