feat(mimo): adapt MiMo-V2.5-TTS series with voicedesign and voiceclone support#8428
feat(mimo): adapt MiMo-V2.5-TTS series with voicedesign and voiceclone support#8428xiangyuw1 wants to merge 1 commit into
Conversation
…e support - Add v2.5 style format support: use parentheses (...) instead of <style> tags for v2.5 models - Add voicedesign model support: custom user prompt field for voice description - Add voiceclone model support: voice audio file path field with DataURL encoding - Update metadata and i18n translations (zh-CN, en-US, ru-RU) - Add comprehensive tests for all new features
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds support for MiMo TTS v2.5 series models, including voicedesign and voiceclone variants. The v2.5 models use parentheses (...) instead of <style>...</style> tags for style prefixes, accept a custom user prompt (required for voicedesign), and allow specifying a local audio file path for voice cloning.
Changes:
- Updated
mimo_tts_api_source.pyto detect v2.5 models, route style prefix formatting, and add user-prompt / voice-audio handling. - Added two new config keys (
mimo-tts-user-prompt,mimo-tts-voice-audio-path) with defaults, schema entries, and localized hints (zh-CN, en-US, ru-RU). - Added unit tests covering v2.5 style/singing parentheses, voicedesign user prompt fallback, and voiceclone audio base64 encoding.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| astrbot/core/provider/sources/mimo_tts_api_source.py | Core logic for v2.5 detection, parentheses style prefix, user prompt precedence, and voice audio reading. |
| astrbot/core/config/default.py | Adds new config defaults and schema metadata for the two new fields. |
| dashboard/src/i18n/locales/zh-CN/features/config-metadata.json | Updates Chinese hints and adds entries for new fields. |
| dashboard/src/i18n/locales/en-US/features/config-metadata.json | Updates English hints and adds entries for new fields. |
| dashboard/src/i18n/locales/ru-RU/features/config-metadata.json | Updates Russian hints and adds entries for new fields. |
| tests/test_mimo_api_sources.py | Tests for v2.5 parentheses, voicedesign, and voiceclone payload building. |
| if "voiceclone" in self.model_name: | ||
| voice_audio_b64 = self._read_voice_audio_base64() | ||
| if voice_audio_b64: | ||
| audio_params["voice"] = voice_audio_b64 |
| def _is_v2_5(self) -> bool: | ||
| """Check if the current model is a v2.5 series model.""" | ||
| return "v2.5" in self.model_name |
| ) | ||
| try: | ||
| payload = provider._build_payload("hello") | ||
| import base64 |
| try: | ||
| suffix = path.suffix.lower().lstrip(".") | ||
| mime_map = {"wav": "audio/wav", "mp3": "audio/mpeg", "ogg": "audio/ogg"} | ||
| mime = mime_map.get(suffix, "audio/wav") |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- Model-type checks are currently done via string containment (e.g.,
'voicedesign' in self.model_name,'voiceclone' in self.model_name,'v2.5' in self.model_name); consider centralizing these into small helper methods (e.g.,_is_voicedesign(),_is_voiceclone()) both to avoid typos and to make future model naming changes easier to accommodate. - In
_read_voice_audio_base64, the bareexcept Exceptioncan hide non-I/O bugs (e.g., programming errors); consider narrowing the exception type (e.g., toOSError/IOError) or re-raising unexpected exceptions after logging so that genuine issues are not silently swallowed. - For
_read_voice_audio_base64, you might want to validate that the path refers to a regular file (path.is_file()) rather than only checkingexists(), to avoid confusing behavior if a directory or special file is provided.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Model-type checks are currently done via string containment (e.g., `'voicedesign' in self.model_name`, `'voiceclone' in self.model_name`, `'v2.5' in self.model_name`); consider centralizing these into small helper methods (e.g., `_is_voicedesign()`, `_is_voiceclone()`) both to avoid typos and to make future model naming changes easier to accommodate.
- In `_read_voice_audio_base64`, the bare `except Exception` can hide non-I/O bugs (e.g., programming errors); consider narrowing the exception type (e.g., to `OSError`/`IOError`) or re-raising unexpected exceptions after logging so that genuine issues are not silently swallowed.
- For `_read_voice_audio_base64`, you might want to validate that the path refers to a regular file (`path.is_file()`) rather than only checking `exists()`, to avoid confusing behavior if a directory or special file is provided.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request introduces support for MiMo TTS v2.5 models, custom user prompts, and voice cloning, along with corresponding configuration updates, localizations, and unit tests. The review feedback recommends defensively handling potentially null configuration values to avoid AttributeError crashes, caching the base64-encoded voice audio to prevent blocking the asyncio event loop with repeated disk reads, and using the standard mimetypes module for more robust MIME type detection.
| self.user_prompt = provider_config.get("mimo-tts-user-prompt", "") | ||
| self.voice_audio_path = provider_config.get("mimo-tts-voice-audio-path", "") |
There was a problem hiding this comment.
Defensively handle cases where mimo-tts-user-prompt or mimo-tts-voice-audio-path might be configured as None (e.g., when cleared in the UI or parsed from null JSON values). Using or "" ensures they are always initialized as strings, preventing potential AttributeError crashes when calling .strip() later. Also, initialize a cache variable for the base64-encoded voice audio to avoid repeated disk reads.
| self.user_prompt = provider_config.get("mimo-tts-user-prompt", "") | |
| self.voice_audio_path = provider_config.get("mimo-tts-voice-audio-path", "") | |
| self.user_prompt = provider_config.get("mimo-tts-user-prompt") or "" | |
| self.voice_audio_path = provider_config.get("mimo-tts-voice-audio-path") or "" | |
| self._voice_audio_cache: str | None = None |
| def _read_voice_audio_base64(self) -> str: | ||
| if not self.voice_audio_path.strip(): | ||
| return "" | ||
| path = Path(self.voice_audio_path.strip()) | ||
| if not path.exists(): | ||
| logger.warning("Voice audio file not found: %s", path) | ||
| return "" | ||
| try: | ||
| suffix = path.suffix.lower().lstrip(".") | ||
| mime_map = {"wav": "audio/wav", "mp3": "audio/mpeg", "ogg": "audio/ogg"} | ||
| mime = mime_map.get(suffix, "audio/wav") | ||
| b64 = base64.b64encode(path.read_bytes()).decode("utf-8") | ||
| return f"data:{mime};base64,{b64}" | ||
| except Exception as exc: | ||
| logger.warning("Failed to read voice audio file %s: %s", path, exc) | ||
| return "" |
There was a problem hiding this comment.
Reading the voice audio file from disk and base64-encoding it on every single payload construction is highly inefficient and blocks the single-threaded asyncio event loop. Since the clone audio file is static for the lifetime of the provider, we should cache the base64-encoded result after the first read. Because this helper is a synchronous function, modifying the shared cache state is safe from race conditions in the single-threaded asyncio event loop.
Additionally, instead of hardcoding a limited set of audio formats in mime_map, we can use Python's standard library mimetypes module to dynamically and robustly guess the correct MIME type.
def _read_voice_audio_base64(self) -> str:
if self._voice_audio_cache is not None:
return self._voice_audio_cache
if not self.voice_audio_path.strip():
self._voice_audio_cache = ""
return ""
path = Path(self.voice_audio_path.strip())
if not path.exists():
logger.warning("Voice audio file not found: %s", path)
self._voice_audio_cache = ""
return ""
try:
import mimetypes
mime, _ = mimetypes.guess_type(str(path))
mime = mime or "audio/wav"
b64 = base64.b64encode(path.read_bytes()).decode("utf-8")
self._voice_audio_cache = f"data:{mime};base64,{b64}"
return self._voice_audio_cache
except Exception as exc:
logger.warning("Failed to read voice audio file %s: %s", path, exc)
self._voice_audio_cache = ""
return ""References
- In a single-threaded asyncio event loop, synchronous functions (code blocks without 'await') are executed atomically and will not be interrupted by other coroutines. Therefore, they are safe from race conditions when modifying shared state within that block.
Summary
This PR adapts the MiMo TTS provider to support the MiMo-V2.5-TTS series, including voicedesign and voiceclone models.
Changes
1. V2.5 Style Format Support
2. Voicedesign Model Support
3. Voiceclone Model Support
4. Metadata and i18n Updates
5. Tests
Testing
Summary by Sourcery
Adapt the MiMo TTS provider to support the MiMo-V2.5-TTS series, including new style formatting, voicedesign prompts, and voiceclone audio cloning, with corresponding config and test updates.
New Features:
Enhancements:
Tests: