Skip to content

feat(tools): add image caption fallback for FileReadTool when provider lacks image modality#8425

Open
kawayiYokami wants to merge 1 commit into
AstrBotDevs:masterfrom
kawayiYokami:feat/file-read-image-fallback
Open

feat(tools): add image caption fallback for FileReadTool when provider lacks image modality#8425
kawayiYokami wants to merge 1 commit into
AstrBotDevs:masterfrom
kawayiYokami:feat/file-read-image-fallback

Conversation

@kawayiYokami
Copy link
Copy Markdown
Contributor

@kawayiYokami kawayiYokami commented May 29, 2026

Motivation / 动机

当前 FileReadTool 读取图片文件时,直接返回 ImageContent 给 LLM。如果当前使用的 provider 不支持多模态(image modality),图片内容会被缓存但模型完全看不到图片内容,等于白读。

此 PR 增加了降级逻辑:

  1. 如果当前 provider 支持图片 → 行为不变,返回 ImageContent
  2. 如果不支持图片,但配置了图转文模型(default_image_caption_provider_id)→ 调用图转文模型生成文字描述返回
  3. 如果不支持图片,也没有配置图转文模型 → 返回明确的错误提示:"您的供应商不支持多模态,并且当前没有配置图转文模型,无法读取图片"

Modifications / 改动点

  • 修改 astrbot/core/tools/computer_tools/fs.py

    • 新增 _provider_supports_image() 辅助函数:检查当前 provider 是否支持 image modality
    • 新增 _caption_image_fallback() 辅助函数:调用配置的图转文 provider 生成图片描述
    • 修改 FileReadTool.call():在返回 ImageContent 前检查 provider 能力,必要时降级
  • This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果

11 passed (image test included), 3 pre-existing failures unrelated to this change

Checklist / 检查清单

  • My changes have been well-tested, and verification steps and screenshots have been provided above.
  • I have ensured that no new dependencies are introduced.
  • My changes do not introduce malicious code.

Summary by Sourcery

Add an image-captioning fallback for FileReadTool so image files remain usable when the active provider lacks image modality support.

New Features:

  • Provide automatic image captioning for image reads when a dedicated image caption provider is configured but the main provider does not support image modality.

Enhancements:

  • Detect provider image-modality support before returning ImageContent from FileReadTool and transparently degrade to text captions or clear error messages when unsupported.

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. area:core The bug / feature is about astrbot's core, backend labels May 29, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In _provider_supports_image, swallowing all exceptions and defaulting to True can silently mask misconfigurations; consider at least logging the exception and/or defaulting to False when capability detection fails so that missing image modality is not accidentally treated as supported.
  • The broad except Exception in _caption_image_fallback will also catch programming errors (e.g., wrong types from text_chat); narrowing this to expected error types or re-raising after logging unexpected exceptions would make failures easier to diagnose.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_provider_supports_image`, swallowing all exceptions and defaulting to `True` can silently mask misconfigurations; consider at least logging the exception and/or defaulting to `False` when capability detection fails so that missing image modality is not accidentally treated as supported.
- The broad `except Exception` in `_caption_image_fallback` will also catch programming errors (e.g., wrong types from `text_chat`); narrowing this to expected error types or re-raising after logging unexpected exceptions would make failures easier to diagnose.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fallback mechanism to caption images when the active provider does not support image modality. The review comments point out a critical issue: passing the file path directly to the caption provider will fail in sandbox environments because the host cannot access the sandbox filesystem. The reviewer suggests extracting the base64 image data from the tool result and passing it as a data URI instead, and recommends adding unit tests for this new functionality.

Comment on lines +233 to +277
async def _caption_image_fallback(
context: ContextWrapper[AstrAgentContext],
image_path: str,
) -> ToolExecResult:
"""Try to caption an image using the configured image caption provider.

Returns the caption text or an error message if no caption provider is available.
"""
from astrbot.core.provider.provider import Provider

umo = context.context.event.unified_msg_origin
cfg = context.context.context.get_config(umo=umo)
provider_settings = cfg.get("provider_settings", {})
caption_provider_id = provider_settings.get("default_image_caption_provider_id", "")

if not caption_provider_id:
return (
"Error: your provider does not support image modality, "
"and no image caption provider is configured. Unable to read image file."
)

caption_provider = context.context.context.get_provider_by_id(caption_provider_id)
if caption_provider is None or not isinstance(caption_provider, Provider):
return (
"Error: your provider does not support image modality, "
f"and the configured image caption provider `{caption_provider_id}` is not available. "
"Unable to read image file."
)

caption_prompt = provider_settings.get(
"image_caption_prompt", "Please describe the image."
)

try:
llm_resp = await caption_provider.text_chat(
prompt=caption_prompt,
image_urls=[image_path],
)
caption = (llm_resp.completion_text or "").strip()
if not caption:
return "Error: image caption provider returned an empty description."
return f"[Image description]: {caption}"
except Exception as exc:
logger.error(f"Image captioning failed: {exc}")
return f"Error: failed to generate image description: {exc}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In sandbox mode (local_env = False), the image file resides inside the sandbox container/environment, while the host running astrbot executes the provider's text_chat. Passing image_path (which is a sandbox path) directly to the provider will fail because the host cannot access the sandbox filesystem directly.

Since read_file_tool_result already reads and compresses the image into base64 format (returned as mcp.types.ImageContent), we should pass the base64 data URI (e.g., data:{mimeType};base64,{data}) to _caption_image_fallback instead of the file path. This avoids re-reading the file and works seamlessly in both local and sandbox environments.

async def _caption_image_fallback(
    context: ContextWrapper[AstrAgentContext],
    image_url: str,
) -> ToolExecResult:
    """Try to caption an image using the configured image caption provider.

    Returns the caption text or an error message if no caption provider is available.
    """
    from astrbot.core.provider.provider import Provider

    umo = context.context.event.unified_msg_origin
    cfg = context.context.context.get_config(umo=umo)
    provider_settings = cfg.get("provider_settings", {})
    caption_provider_id = provider_settings.get("default_image_caption_provider_id", "")

    if not caption_provider_id:
        return (
            "Error: your provider does not support image modality, "
            "and no image caption provider is configured. Unable to read image file."
        )

    caption_provider = context.context.context.get_provider_by_id(caption_provider_id)
    if caption_provider is None or not isinstance(caption_provider, Provider):
        return (
            "Error: your provider does not support image modality, "
            f"and the configured image caption provider `{caption_provider_id}` is not available. "
            "Unable to read image file."
        )

    caption_prompt = provider_settings.get(
        "image_caption_prompt", "Please describe the image."
    )

    try:
        llm_resp = await caption_provider.text_chat(
            prompt=caption_prompt,
            image_urls=[image_url],
        )
        caption = (llm_resp.completion_text or "").strip()
        if not caption:
            return "Error: image caption provider returned an empty description."
        return f"[Image description]: {caption}"
    except Exception as exc:
        logger.error(f"Image captioning failed: {exc}")
        return f"Error: failed to generate image description: {exc}"

Comment on lines +361 to +371
if (
isinstance(result, mcp.types.CallToolResult)
and result.content
and any(
isinstance(item, mcp.types.ImageContent) for item in result.content
)
and not _provider_supports_image(context)
):
return await _caption_image_fallback(context, normalized_path)

return result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Update the fallback call to extract the base64 image data from the CallToolResult and pass it as a data URI to _caption_image_fallback. This ensures compatibility with sandbox environments where the host cannot directly access the local file path. Additionally, please ensure this new attachment handling functionality is accompanied by corresponding unit tests.

Suggested change
if (
isinstance(result, mcp.types.CallToolResult)
and result.content
and any(
isinstance(item, mcp.types.ImageContent) for item in result.content
)
and not _provider_supports_image(context)
):
return await _caption_image_fallback(context, normalized_path)
return result
if (
isinstance(result, mcp.types.CallToolResult)
and result.content
and any(
isinstance(item, mcp.types.ImageContent) for item in result.content
)
and not _provider_supports_image(context)
):
image_item = next(
item for item in result.content if isinstance(item, mcp.types.ImageContent)
)
image_url = f"data:{image_item.mimeType};base64,{image_item.data}"
return await _caption_image_fallback(context, image_url)
return result
References
  1. New functionality, such as handling attachments, should be accompanied by corresponding unit tests.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core The bug / feature is about astrbot's core, backend lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants