Skip to content

Phase 2: Add extraction logic for Unicode test data#7213

Open
samyoon20 wants to merge 5 commits into
adoptium:masterfrom
samyoon20:phase2-extract-unicode-data
Open

Phase 2: Add extraction logic for Unicode test data#7213
samyoon20 wants to merge 5 commits into
adoptium:masterfrom
samyoon20:phase2-extract-unicode-data

Conversation

@samyoon20

Copy link
Copy Markdown

Phase 2: Extract and Use Downloaded Unicode Test Data

Summary

Adds extraction logic to MBCS test build files to use Unicode test data downloaded by Phase 1 dependency management system.

Changes Made

  • functional/MBCS_Tests/codepoint/build.xml

    • Added extractUnicodeData target
    • Extracts 19 files: 9 UnicodeData.txt + 9 Unihan_IRGSources.txt + 1 GB18030
    • Handles special test file (u32FF) that stays in git
  • functional/MBCS_Tests/unicode/build.xml

    • Added extractUnicodeData target
    • Extracts 45 files (5 files × 9 Unicode versions)
    • Handles 4 special test files that stay in git
  • functional/MBCS_Tests/CLDR_11/build.xml

    • Added copyIcu4jDependencies target
    • Copies 2 ICU4J JARs from ${LIB_DIR}

Design Decisions

  • Extract ALL Unicode versions (10.0.0-17.0.0) for backward compatibility
  • Follow dacapo pattern for dependency management consistency
  • Comprehensive inline documentation explaining design choices
  • Minimal changes to existing code (only 17 lines modified)

Related Work

Testing Plan

Will test on Jenkins Grinder with these parameters:

For MBCSTest_codepoint_0:

- Add extraction targets to codepoint/build.xml for UCD and Unihan files
- Add extraction targets to unicode/build.xml for UCD files
- Add ICU4J copy targets to CLDR_11/build.xml
- Extract all Unicode versions (10.0.0-17.0.0) for compatibility
- Follow dacapo pattern for dependency management
- Comprehensive inline documentation of design decisions

Related to issue adoptium#5161
- Change from downloading Unihan ZIP archives (6-8 MB each) to individual Unihan_IRGSources.txt files (~1-2 MB each)
- Replace unzip operations with simple copy operations in codepoint/build.xml
- Reduces bandwidth by 75% for Unihan data (40% overall)
- Simpler code: 1 line copy vs 6 line unzip block per version
- Faster execution: no decompression needed
- Requires corresponding TKG PR update to getDependencies.pl

Related to issue adoptium#5161
Implementation:
- Load JDK-to-Unicode mapping from UnicodeVers.properties
- Calculate version code (JDK_VERSION + '000000')
- Extract only the mapped Unicode version
- Conditional GB18030 copy based on mapping

Error handling:
- Validate mapping exists before extraction
- Fail fast with clear error if mapping missing
- Prevents silent failures or wrong version usage

Related to issue adoptium#5161
Add xmlns:if namespace declaration to support if:set attribute
for conditional GB18030 file copying.

Changes:
- Add xmlns:if='ant:if' to project tag in both build.xml files
- Enables conditional copy based on Unicode version mapping

Related to issue adoptium#5161
Ant doesn't support nested property expansion like ${unicode.${jdk.version.code}}.
Changed to use <propertycopy> task which properly handles dynamic property names.

This fixes the build error:
src '/home/jenkins/externalDependency/lib/UCD-${unicode.${jdk.version.code}}.zip' doesn't exist.

Related to issue adoptium#5161
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant