Skip to content

Add pre commit and CI checks#5

Open
jjomier wants to merge 4 commits intomainfrom
jjomier/add_pre_commit
Open

Add pre commit and CI checks#5
jjomier wants to merge 4 commits intomainfrom
jjomier/add_pre_commit

Conversation

@jjomier
Copy link
Copy Markdown

@jjomier jjomier commented Feb 25, 2026

  • Add pre-commit script
  • Add CI checks
  • Fix linting

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 25, 2026

Greptile Summary

This PR adds pre-commit hooks and CI checks to enforce code quality standards, and applies auto-formatting fixes across the codebase.

Major changes:

  • Added GitHub Actions CI workflow to run pre-commit checks on push/PR
  • Configured pre-commit with ruff (Python linting/formatting), markdownlint, and file checks
  • Applied ruff auto-formatting to all Python files (import sorting, whitespace cleanup, line length fixes)
  • Fixed markdown formatting issues in README and config files

Issues found:

  • Critical bug in train/vlm_sft_train.py:184: The mapped dataset is immediately discarded by calling .cast() on the original dataset instead of the mapped one, losing all image path transformations
  • Minor: Project name "NV-Generate-CTMR" in comments should be "NV-Reason-CXR" (.markdownlint.yaml, pyproject.toml)

Confidence Score: 2/5

  • Critical bug in SFT training will cause dataset transformations to be silently discarded
  • The PR adds valuable tooling infrastructure (pre-commit, CI) and applies consistent formatting. However, train/vlm_sft_train.py:184 contains a critical logic error where the mapped dataset is immediately overwritten by calling .cast() on the original unmapped dataset. This will cause image path updates and field transformations to be lost during SFT training, potentially breaking the training pipeline. The formatting changes are safe, but this bug must be fixed before merge.
  • train/vlm_sft_train.py requires immediate attention due to the dataset mapping bug on line 184

Important Files Changed

Filename Overview
.github/workflows/ci.yml Adds standard CI workflow for running pre-commit checks on push/PR to main
.pre-commit-config.yaml Configures pre-commit hooks for file checks, ruff linting/formatting, and markdown linting
.markdownlint.yaml Configures markdown linting with relaxed line length; contains wrong project name in comment
pyproject.toml Adds ruff configuration for Python linting and formatting; contains wrong project name in comment
train/vlm_sft_train.py Ruff auto-formatting applied, but critical bug remains: line 184 discards mapped dataset from line 183

Last reviewed commit: b405a44

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 25, 2026

Greptile Summary

This PR adds a pre-commit framework and CI pipeline for code quality enforcement, along with lint/format fixes applied by ruff across the codebase.

  • Adds .pre-commit-config.yaml with hooks for trailing whitespace, YAML/JSON validation, ruff linting/formatting, and markdownlint
  • Adds .github/workflows/ci.yml to run pre-commit checks on push/PR to main
  • Adds pyproject.toml with ruff configuration (Python 3.11, line-length 150, standard lint rules)
  • Adds .markdownlint.yaml with relaxed line-length settings for existing docs
  • Applies ruff auto-formatting across all Python files (vlm_grpo_train.py, vlm_rewards.py, vlm_sft_train.py): import sorting, whitespace cleanup, quote normalization, type hint modernization
  • Cleans up trailing whitespace in README.md, configs/grpo_config.yaml, configs/sft_config.yaml, and images/README.md

Issues found:

  • Pre-existing bug in train/vlm_sft_train.py:183-184: the .map() result is immediately overwritten by .cast() on the original dataset, discarding the data transformations
  • Duplicate check-merge-conflict hook in .pre-commit-config.yaml
  • All four new config files reference "NV-Generate-CTMR" instead of the correct project name "NV-Reason-CXR"
  • .markdownlint.yaml is missing a trailing newline, which would be caught by its own end-of-file-fixer hook

Confidence Score: 4/5

  • This PR is safe to merge — the changes are predominantly formatting/linting with no new logic, though a pre-existing dataset bug should be addressed.
  • The vast majority of changes are whitespace/formatting fixes auto-applied by ruff, plus new CI/pre-commit configuration files. No new logic is introduced. The one logic issue (discarded .map() result in vlm_sft_train.py) is pre-existing and was not introduced by this PR, but is in the diff due to reformatting. The remaining issues (duplicate hook, wrong project name, missing newline) are minor.
  • train/vlm_sft_train.py has a pre-existing bug where the dataset .map() result is discarded on line 184. .pre-commit-config.yaml has a duplicate hook entry.

Important Files Changed

Filename Overview
.github/workflows/ci.yml New CI workflow running pre-commit on push/PR to main. References wrong project name in comment ("NV-Generate-CTMR").
.markdownlint.yaml New markdownlint config with relaxed line-length rule. Missing trailing newline; wrong project name in comment.
.pre-commit-config.yaml New pre-commit config with general hooks, ruff, and markdownlint. Has duplicate check-merge-conflict hook entry.
README.md Whitespace cleanup and markdown formatting fixes (trailing spaces, blank lines, heading syntax). No functional changes.
configs/grpo_config.yaml Trailing whitespace removed throughout. No functional changes to GRPO training configuration.
configs/sft_config.yaml Trailing whitespace removed throughout. No functional changes to SFT training configuration.
images/README.md Added proper markdown heading and trailing newline to placeholder README.
pyproject.toml New ruff configuration targeting Python 3.11 with standard lint rules. Wrong project name in comment.
train/vlm_grpo_train.py Ruff formatting applied: import reordering, whitespace fixes, quote normalization, type hint modernization. No logical changes.
train/vlm_rewards.py Ruff formatting applied: whitespace, blank line cleanup, list reformatting. No logical changes to reward functions.
train/vlm_sft_train.py Ruff formatting applied. Pre-existing bug preserved: .map() result on line 183 is discarded by immediate overwrite on line 184.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Push/PR to main] --> B[CI Workflow Triggered]
    B --> C[Checkout Code]
    C --> D[Setup Python 3.11]
    D --> E[Install pre-commit]
    E --> F[Run pre-commit hooks]
    F --> G[pre-commit-hooks]
    F --> H[ruff lint + format]
    F --> I[markdownlint]
    G --> G1[trailing-whitespace]
    G --> G2[end-of-file-fixer]
    G --> G3[check-yaml / check-json]
    G --> G4[check-merge-conflict x2]
    G --> G5[check-added-large-files]
    G --> G6[debug-statements]
    H --> H1[ruff --fix]
    H --> H2[ruff-format]
    I --> I1[markdownlint with .markdownlint.yaml]
Loading

Last reviewed commit: 7a456b5

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment thread .pre-commit-config.yaml
Comment thread train/vlm_sft_train.py
Comment on lines 183 to +184
train_dataset = my_dataset.map(vlm_data_format_dict, fn_kwargs={"image_dir": script_args.image_dir})
train_dataset = my_dataset.cast(features)
train_dataset = my_dataset.cast(features)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapped dataset is silently discarded

Line 183 maps my_dataset with vlm_data_format_dict and stores the result in train_dataset, but line 184 immediately overwrites train_dataset by calling my_dataset.cast(features) on the original (unmapped) dataset. This discards all transformations from the .map() call, meaning image paths are never updated and extra columns are never set to None.

This should likely be:

Suggested change
train_dataset = my_dataset.map(vlm_data_format_dict, fn_kwargs={"image_dir": script_args.image_dir})
train_dataset = my_dataset.cast(features)
train_dataset = my_dataset.cast(features)
train_dataset = my_dataset.map(vlm_data_format_dict, fn_kwargs={"image_dir": script_args.image_dir})
train_dataset = train_dataset.cast(features)

Comment thread .github/workflows/ci.yml Outdated
Comment thread .markdownlint.yaml Outdated
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment thread train/vlm_sft_train.py
)
train_dataset = my_dataset.map(vlm_data_format_dict, fn_kwargs={"image_dir": script_args.image_dir})
train_dataset = my_dataset.cast(features)
train_dataset = my_dataset.cast(features)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 184 calls .cast() on the original my_dataset, discarding the mapped train_dataset from line 183. The image path updates from vlm_data_format_dict are lost.

Should be:

Suggested change
train_dataset = my_dataset.cast(features)
train_dataset = train_dataset.cast(features)

Comment thread .markdownlint.yaml
@@ -0,0 +1,7 @@
# Markdownlint config for NV-Generate-CTMR
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment references NV-Generate-CTMR but this repo is NV-Reason-CXR

Suggested change
# Markdownlint config for NV-Generate-CTMR
# Markdownlint config for NV-Reason-CXR

Comment thread pyproject.toml
@@ -0,0 +1,13 @@
# Minimal config for tooling (pre-commit, ruff). NV-Generate-CTMR has no installable package.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment references NV-Generate-CTMR but this repo is NV-Reason-CXR

Suggested change
# Minimal config for tooling (pre-commit, ruff). NV-Generate-CTMR has no installable package.
# Minimal config for tooling (pre-commit, ruff). NV-Reason-CXR has no installable package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant