Skip to content

Feature Request: Fal audio, speech, and music generation support #328

@tombeckenham

Description

@tombeckenham

Feature Request

Description

Add audio, speech, and transcription adapters to the @tanstack/ai-fal package. The fal adapter currently supports image and video generation, but fal's platform also offers hundreds of models spanning:

  • Text-to-Speech (e.g., fal-ai/kokoro — multi-language TTS)
  • Audio generation (music, sound effects, audio-to-audio, voice-change, voice-clone, audio enhancement, separation, isolation, merge, understanding — e.g. fal-ai/stable-audio-25, which generates both music and sound effects)
  • Speech-to-Text (e.g., fal-ai/whisper, fal-ai/wizper — transcription)

Motivation

TanStack AI's fal adapter (@tanstack/ai-fal) currently implements falImage and falVideo adapters following the tree-shakeable adapter pattern. Adding audio/speech/transcription adapters would complete fal's media coverage and align with TanStack AI's goal of being a comprehensive, provider-agnostic AI SDK.

Proposed API

A single broad falAudio adapter rather than a music/sfx split, because fal's audio catalog doesn't cleave along those lines — dozens of audio-to-audio, voice, enhancement, and understanding models live alongside music/SFX generators, and individual models (e.g. stable-audio-25) span both music and sound effects.

import { falAudio, falSpeech, falTranscription } from '@tanstack/ai-fal/adapters'

// Audio generation (music, sound effects, audio-to-audio, etc.)
const audioAdapter = falAudio('fal-ai/stable-audio-25')

// Text-to-Speech
const speechAdapter = falSpeech('fal-ai/kokoro')

// Speech-to-Text
const transcriptionAdapter = falTranscription('fal-ai/whisper')

Additional Context

  • Existing adapters use fal.subscribe() (image) and fal.queue (video) patterns
  • Audio generation may use either pattern depending on model latency
  • The fal SDK (@fal-ai/client) already supports audio responses with File/Audio output types
  • Model metadata types in model-meta.ts are extended for audio, speech, and transcription models

Note (2026-04-22)

An earlier iteration briefly split this into separate falMusic / falSoundEffects adapters and matching generateMusic / generateSoundEffects activities. That split was reverted once fal's full audio catalog (screenshot: dozens of audio-to-audio, voice-change, voice-clone, enhancement, separation, isolation, understanding, merge-audios models) made clear the music/SFX binary is too narrow. A single generateAudio activity and falAudio adapter better match the reality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions