Skip to content

Release: mcpmark verified — pinned versions + stabilized standard verifiers#264

Merged
zjwu0522 merged 14 commits into
mainfrom
pin-all-versions
Jun 12, 2026
Merged

Release: mcpmark verified — pinned versions + stabilized standard verifiers#264
zjwu0522 merged 14 commits into
mainfrom
pin-all-versions

Conversation

@zjwu0522

@zjwu0522 zjwu0522 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

MCPMark Verified

MCPMark Verified is a stabilized version of MCPMark's standard task set. The tasks are unchanged. What changed is that every environment is pinned to a fixed server version, and every verification script has been reviewed and tightened so that a correct solution passes and an incorrect one fails, consistently across runs and over time.

This PR promotes the pin-all-versions integration branch to main for the release.

Pinned environments

Environment Pinned server
Filesystem @modelcontextprotocol/server-filesystem@2025.12.18
GitHub ghcr.io/github/github-mcp-server:v0.15.0
Notion @notionhq/notion-mcp-server@1.9.1
Playwright @playwright/mcp@0.0.68
Postgres postgres-mcp==0.3.0

The evaluation harness is pinned as well (model call parameters, reasoning-effort handling, and the agent loop), so the model under test is the only variable across runs.

Verifier changes

All 127 standard tasks were reviewed. Fixes fall into two categories.

Major — a verifier or its fixture was rebuilt:

  • Postgres dba_vector_analysis: the 500-line vector fixture is inlined into setup and the verifier rewritten for deterministic state.
  • Playwright extraction_table: regenerated the reference data and rewrote the extraction checks.
  • WebArena search_filtering_operations and the shopping-admin analytics tasks (fitness_promotion_strategy, marketing_customer_analysis, sales_inventory_analysis, customer_segmentation_setup): reworked logic and clarified descriptions.
  • Notion work_history_addition, hyperfocus_analysis_report, quarterly_review_dashboard: overhauled verifiers and descriptions on the most error-prone pages.

Minor — targeted robustness fixes: GitHub (PR-title-aware squash detection, case-insensitive matching, pinned ESLint v8), Postgres (tighter acceptance conditions and role cleanup), Filesystem (clarified descriptions), and smart-quote normalization across WebArena.

Models / reasoning effort

  • Added a public gpt-5.5 model entry, plus xhigh and max reasoning-effort levels.
  • LiteLLM config: enforcer_mode, think_mode, max_tokens, temperature.

Rolled-up PRs

#252, #255 (github) · #254 (playwright) · #260 (notion) · #262 (postgres) · #261 (reasoning effort) · #263 (github legacy_name)

🤖 Generated with Claude Code

Co-authored-by: xyliugo liuxiangyan6@gmail.com
Co-authored-by: dulingxiao lxdu0314@gmail.com

zjwu0522 and others added 12 commits April 14, 2026 09:34
…gement task

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Pin filesystem (@2025.12.18), postgres (0.3.0), and playwright (0.0.68)
versions. Also pin notion @1.9.1 in base_agent.py for consistency with
mcpmark_agent.py. GitHub (v0.15.0) and notion were already pinned in #246.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…teLLM config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mcpmark-cicd needs to be public for GitHub Actions workflows to work.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: dulingxiao <lxdu0314@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@xyliugo xyliugo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

zjwu0522 and others added 2 commits June 12, 2026 10:20
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@zjwu0522 zjwu0522 merged commit 84faaca into main Jun 12, 2026
2 checks passed
@zjwu0522 zjwu0522 deleted the pin-all-versions branch June 12, 2026 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants