Skip to content

feat(mimo): adapt MiMo-V2.5-TTS series with voicedesign and voiceclone support#8428

Open
xiangyuw1 wants to merge 1 commit into
AstrBotDevs:masterfrom
xiangyuw1:feat/mimo-v2.5-tts-adaptation
Open

feat(mimo): adapt MiMo-V2.5-TTS series with voicedesign and voiceclone support#8428
xiangyuw1 wants to merge 1 commit into
AstrBotDevs:masterfrom
xiangyuw1:feat/mimo-v2.5-tts-adaptation

Conversation

@xiangyuw1
Copy link
Copy Markdown

@xiangyuw1 xiangyuw1 commented May 29, 2026

Summary

This PR adapts the MiMo TTS provider to support the MiMo-V2.5-TTS series, including voicedesign and voiceclone models.

Changes

1. V2.5 Style Format Support

  • V2.5 models use parentheses (...) instead of <style>...</style> tags for style control
  • Added _is_v2_5() method to detect v2.5 models
  • Updated _build_style_prefix() to use the correct format based on model version

2. Voicedesign Model Support

  • Added mimo-tts-user-prompt config field for custom user prompts
  • Required for voicedesign models to describe the desired voice via natural language
  • Falls back to seed text for other models when user prompt is empty

3. Voiceclone Model Support

  • Added mimo-tts-voice-audio-path config field for voice audio file path
  • Reads audio file and encodes to DataURL format (data:audio/wav;base64,...) at runtime
  • Falls back to preset voice when no audio file is specified

4. Metadata and i18n Updates

  • Updated config metadata with new fields and descriptions
  • Added translations for zh-CN, en-US, and ru-RU locales
  • Updated hints to explain v2/v2.5 format differences

5. Tests

  • Added comprehensive tests for all new features
  • 18 tests passing, covering v2.5 style, voicedesign, and voiceclone scenarios

Testing

  • All existing tests pass
  • New tests added for v2.5 style format, voicedesign user prompt, and voiceclone audio path
  • Tested with actual MiMo API calls

Summary by Sourcery

Adapt the MiMo TTS provider to support the MiMo-V2.5-TTS series, including new style formatting, voicedesign prompts, and voiceclone audio cloning, with corresponding config and test updates.

New Features:

  • Support MiMo V2.5 TTS models that use parentheses-based style and dialect prefixes instead of style tags.
  • Add custom user prompt configuration for voicedesign TTS models, falling back to seed text when absent.
  • Add voice cloning support for voiceclone TTS models by loading a configurable local audio file as the voice source, with fallback to preset voices.

Enhancements:

  • Extend default configuration and metadata with new MiMo TTS fields and hints describing v2 versus v2.5 formatting and model-specific options.
  • Update i18n config metadata for MiMo TTS options across zh-CN, en-US, and ru-RU locales.

Tests:

  • Add tests covering V2.5 style and singing behavior, voicedesign user prompt and seed text fallback, and voiceclone audio file and voice fallback handling.

…e support

- Add v2.5 style format support: use parentheses (...) instead of <style> tags for v2.5 models
- Add voicedesign model support: custom user prompt field for voice description
- Add voiceclone model support: voice audio file path field with DataURL encoding
- Update metadata and i18n translations (zh-CN, en-US, ru-RU)
- Add comprehensive tests for all new features
Copilot AI review requested due to automatic review settings May 29, 2026 22:50
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. labels May 29, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds support for MiMo TTS v2.5 series models, including voicedesign and voiceclone variants. The v2.5 models use parentheses (...) instead of <style>...</style> tags for style prefixes, accept a custom user prompt (required for voicedesign), and allow specifying a local audio file path for voice cloning.

Changes:

  • Updated mimo_tts_api_source.py to detect v2.5 models, route style prefix formatting, and add user-prompt / voice-audio handling.
  • Added two new config keys (mimo-tts-user-prompt, mimo-tts-voice-audio-path) with defaults, schema entries, and localized hints (zh-CN, en-US, ru-RU).
  • Added unit tests covering v2.5 style/singing parentheses, voicedesign user prompt fallback, and voiceclone audio base64 encoding.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
astrbot/core/provider/sources/mimo_tts_api_source.py Core logic for v2.5 detection, parentheses style prefix, user prompt precedence, and voice audio reading.
astrbot/core/config/default.py Adds new config defaults and schema metadata for the two new fields.
dashboard/src/i18n/locales/zh-CN/features/config-metadata.json Updates Chinese hints and adds entries for new fields.
dashboard/src/i18n/locales/en-US/features/config-metadata.json Updates English hints and adds entries for new fields.
dashboard/src/i18n/locales/ru-RU/features/config-metadata.json Updates Russian hints and adds entries for new fields.
tests/test_mimo_api_sources.py Tests for v2.5 parentheses, voicedesign, and voiceclone payload building.

Comment on lines +129 to +132
if "voiceclone" in self.model_name:
voice_audio_b64 = self._read_voice_audio_base64()
if voice_audio_b64:
audio_params["voice"] = voice_audio_b64
Comment on lines +52 to +54
def _is_v2_5(self) -> bool:
"""Check if the current model is a v2.5 series model."""
return "v2.5" in self.model_name
)
try:
payload = provider._build_payload("hello")
import base64
Comment on lines +98 to +101
try:
suffix = path.suffix.lower().lstrip(".")
mime_map = {"wav": "audio/wav", "mp3": "audio/mpeg", "ogg": "audio/ogg"}
mime = mime_map.get(suffix, "audio/wav")
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • Model-type checks are currently done via string containment (e.g., 'voicedesign' in self.model_name, 'voiceclone' in self.model_name, 'v2.5' in self.model_name); consider centralizing these into small helper methods (e.g., _is_voicedesign(), _is_voiceclone()) both to avoid typos and to make future model naming changes easier to accommodate.
  • In _read_voice_audio_base64, the bare except Exception can hide non-I/O bugs (e.g., programming errors); consider narrowing the exception type (e.g., to OSError/IOError) or re-raising unexpected exceptions after logging so that genuine issues are not silently swallowed.
  • For _read_voice_audio_base64, you might want to validate that the path refers to a regular file (path.is_file()) rather than only checking exists(), to avoid confusing behavior if a directory or special file is provided.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Model-type checks are currently done via string containment (e.g., `'voicedesign' in self.model_name`, `'voiceclone' in self.model_name`, `'v2.5' in self.model_name`); consider centralizing these into small helper methods (e.g., `_is_voicedesign()`, `_is_voiceclone()`) both to avoid typos and to make future model naming changes easier to accommodate.
- In `_read_voice_audio_base64`, the bare `except Exception` can hide non-I/O bugs (e.g., programming errors); consider narrowing the exception type (e.g., to `OSError`/`IOError`) or re-raising unexpected exceptions after logging so that genuine issues are not silently swallowed.
- For `_read_voice_audio_base64`, you might want to validate that the path refers to a regular file (`path.is_file()`) rather than only checking `exists()`, to avoid confusing behavior if a directory or special file is provided.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for MiMo TTS v2.5 models, custom user prompts, and voice cloning, along with corresponding configuration updates, localizations, and unit tests. The review feedback recommends defensively handling potentially null configuration values to avoid AttributeError crashes, caching the base64-encoded voice audio to prevent blocking the asyncio event loop with repeated disk reads, and using the standard mimetypes module for more robust MIME type detection.

Comment on lines +47 to +48
self.user_prompt = provider_config.get("mimo-tts-user-prompt", "")
self.voice_audio_path = provider_config.get("mimo-tts-voice-audio-path", "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Defensively handle cases where mimo-tts-user-prompt or mimo-tts-voice-audio-path might be configured as None (e.g., when cleared in the UI or parsed from null JSON values). Using or "" ensures they are always initialized as strings, preventing potential AttributeError crashes when calling .strip() later. Also, initialize a cache variable for the base64-encoded voice audio to avoid repeated disk reads.

Suggested change
self.user_prompt = provider_config.get("mimo-tts-user-prompt", "")
self.voice_audio_path = provider_config.get("mimo-tts-voice-audio-path", "")
self.user_prompt = provider_config.get("mimo-tts-user-prompt") or ""
self.voice_audio_path = provider_config.get("mimo-tts-voice-audio-path") or ""
self._voice_audio_cache: str | None = None

Comment on lines +91 to +106
def _read_voice_audio_base64(self) -> str:
if not self.voice_audio_path.strip():
return ""
path = Path(self.voice_audio_path.strip())
if not path.exists():
logger.warning("Voice audio file not found: %s", path)
return ""
try:
suffix = path.suffix.lower().lstrip(".")
mime_map = {"wav": "audio/wav", "mp3": "audio/mpeg", "ogg": "audio/ogg"}
mime = mime_map.get(suffix, "audio/wav")
b64 = base64.b64encode(path.read_bytes()).decode("utf-8")
return f"data:{mime};base64,{b64}"
except Exception as exc:
logger.warning("Failed to read voice audio file %s: %s", path, exc)
return ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Reading the voice audio file from disk and base64-encoding it on every single payload construction is highly inefficient and blocks the single-threaded asyncio event loop. Since the clone audio file is static for the lifetime of the provider, we should cache the base64-encoded result after the first read. Because this helper is a synchronous function, modifying the shared cache state is safe from race conditions in the single-threaded asyncio event loop.

Additionally, instead of hardcoding a limited set of audio formats in mime_map, we can use Python's standard library mimetypes module to dynamically and robustly guess the correct MIME type.

    def _read_voice_audio_base64(self) -> str:
        if self._voice_audio_cache is not None:
            return self._voice_audio_cache

        if not self.voice_audio_path.strip():
            self._voice_audio_cache = ""
            return ""
        path = Path(self.voice_audio_path.strip())
        if not path.exists():
            logger.warning("Voice audio file not found: %s", path)
            self._voice_audio_cache = ""
            return ""
        try:
            import mimetypes
            mime, _ = mimetypes.guess_type(str(path))
            mime = mime or "audio/wav"
            b64 = base64.b64encode(path.read_bytes()).decode("utf-8")
            self._voice_audio_cache = f"data:{mime};base64,{b64}"
            return self._voice_audio_cache
        except Exception as exc:
            logger.warning("Failed to read voice audio file %s: %s", path, exc)
            self._voice_audio_cache = ""
            return ""
References
  1. In a single-threaded asyncio event loop, synchronous functions (code blocks without 'await') are executed atomically and will not be interrupted by other coroutines. Therefore, they are safe from race conditions when modifying shared state within that block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants