feat(cartesia): add ink-2 stt by charlotte-zhuang · Pull Request #5827 · livekit/agents

charlotte-zhuang · 2026-05-24T05:45:58Z

Adds the Cartesia Ink 2 STT model.

This model has a different API from Ink Whisper and supports different features, most importantly turn detection.

A decision was made to use the same cartesia.STT() class for both the new and old API to avoid confusion as to which class to use.

transcribe_on_flush

I also added a way to use both ink-whisper and ink-2 in conjunction with stream.flush().

I'm not entirely sure how this works in LiveKit, but we have some users who want to tell the model when turns end rather than our model telling them when the turn ended. Those users will see increased latency and unexpected transcripts without this "transcribe_on_flush" behavior.

The reason why I couldn't use the LegacyRecognizeStream is because that implementation streams out FINAL_TRANSCRIPT in chunks as soon as they become available. Then, LiveKit concatenates all FINAL_TRANSCRIPT speech events with spaces:

self._audio_transcript += f" {transcript}"

This space concatenation can cause issues with the transcript by inserting spaces in unexpected places. My solution was to create a new opt-in behavior where FINAL_TRANSCRIPT is only emitted after stream.flush() is called and all audio up until that point is flushed. This way, the entire human turn is put in a single transcript and the space concatenation issue is avoided.

An alternative solution would be to add a flag to SpeechEvent like transcript_event_sep: str = " " so that plugins can specify how their transcripts should be concatenated. Then, LegacyRecognizeStream can just emit transcripts with transcript_event_sep="".

Changes

These changes could be considered breaking in a way, but they should not affect expected usage patterns:

Default model for English is now ink-2
START_OF_SPEECH, INTERIM_TRANSCRIPT, PREFLIGHT_TRANSCRIPT, and END_OF_SPEECH events will be emitted
Aligned transcripts are no longer supported
update_options(model="...") is now a no-op. There is no reason to update your model mid-session so I'm not sure why this was added in the first place. This should not negatively effect any users since before Ink 2, there was only a single Ink Whisper snapshot, making the method even more bewildering.
SpeechStream is now an ABC to allow for different behavior based on the model. It is highly unexpected for users to be constructing their own cartesia.stt.SpeechStream instances.
Ink Whisper checks self._FlushSentinel and sends "finalize" commands to the server

These changes are purely additive:

TurnDetectingRecognizeStream, which is the default for ink-2
TranscribeOnFlushRecognizeStream, which makes stream.flush() control when the final transcript is emitted
cartesia.py example agent
AUDIO_ENCODING constant
doc strings

Testing

Tested using cartesia.py using both ink-2 and ink-whisper.

Also verified that make check passes and existing tests pass.

charlotte-zhuang · 2026-05-25T00:42:41Z

It looks like Test STT / test-stt is failing because:

Some errors with livekit.plugins.aws and livekit.plugins.fireworksai
OPENAI_API_KEY is not set
The GitHub action was unable to post a comment to this PR

FAILED tests/test_stt.py::test_recognize[livekit.plugins.openai] - openai.OpenAIError: Missing credentials. Please pass an `api_key`, `workload_identity`, `admin_api_key`, or set the `OPENAI_API_KEY` or `OPENAI_ADMIN_KEY` environment variable.
FAILED tests/test_stt.py::test_stream[livekit.plugins.fireworksai] - RuntimeError: livekit.plugins.fireworksai.stt.SpeechStream is closed
FAILED tests/test_stt.py::test_stream[livekit.plugins.openai] - openai.OpenAIError: Missing credentials. Please pass an `api_key`, `workload_identity`, `admin_api_key`, or set the `OPENAI_API_KEY` or `OPENAI_ADMIN_KEY` environment variable.
FAILED tests/test_stt.py::test_stream[livekit.plugins.aws] - smithy_core.exceptions.CallError: Unknown error for operation com.amazonaws.transcribestreaming#StartStreamTranscription - status: 403 - id: com.amazonaws.transcribestreaming#UnrecognizedClientException

POST /repos/livekit/agents/issues/5827/comments - 403 with id 8009:24BDD6:12551F54:4647D1C7:6A1310E4 in 157ms
RequestError [HttpError]: Resource not accessible by integration - https://docs.github.com/rest/issues/comments#create-an-issue-comment

tinalenguyen · 2026-05-26T17:44:56Z

-                samples_per_channel=samples_50ms,
-            )
+@dataclass
+class STTOptions:


do you foresee more parameters introduced in the future that would apply to both models? i'm a bit hesitant on deprecating this at the moment

I actually deprecated STTOptions since this class appears to be for internal use only.

The one public use was in SpeechStream.__init__(), but realistically users get a SpeechStream instance by calling STT.stream(), not by constructing their own instances. That was why I made SpeechStream an ABC, which removed the need for STTOptions.

tinalenguyen · 2026-05-26T17:55:23Z

+from .models import STTLanguages
+
+
+class CartesiaRecognizeStream(stt.RecognizeStream):


i wonder if it would be possible to condense this to something like:

CartesiaRecognizeStream: TypeAlias = TurnsRecognizeStream | LegacyRecognizeStream

what do you think of this?

Do you know if users call methods on the return of STT.stream()? i.e. is it important that all my RecognizeStream implementations have the same interface?

I just wanted to avoid a situation where you would need to do something like this

stt = cartesia.STT(model="ink-whisper") stream = cartesia.stream() cast(stream, Any).update_options(language="es")

So the CartesiaRecognizeStream ABC helps to make sure all our RecognizeStream implementations follow the same interface.

If I were to implement this for the first time, I would probably make our SpeechStream.update_options method private.

tinalenguyen · 2026-05-26T17:56:12Z

/test-stt

github-actions · 2026-05-26T18:00:33Z

STT Test Results

Status: ✗ Some tests failed

Metric	Count
✓ Passed	22
✗ Failed	2
× Errors	1
→ Skipped	16
▣ Total	41
⏱ Duration	207.9s

Failed Tests

tests.test_stt::test_stream[livekit.plugins.nvidia]

stt_factory = <function parameter_factory.<locals>.<lambda> at 0x7fe267f21bc0>
request = <FixtureRequest for <Coroutine test_stream[livekit.plugins.nvidia]>>

    @pytest.mark.usefixtures("job_process")
    @pytest.mark.parametrize("stt_factory", STTs)
    async def test_stream(stt_factory: Callable[[], STT], request):
        sample_rate = SAMPLE_RATE
        plugin_id = request.node.callspec.id.split("-")[0]
        frames, transcript, _ = await make_test_speech(chunk_duration_ms=10, sample_rate=sample_rate)
  
        # TODO: differentiate missing key vs other errors
        try:
            stt_instance: STT = stt_factory()
        except ValueError as e:
            pytest.skip(f"{plugin_id}: {e}")
  
        async with stt_instance as stt:
            label = f"{stt.model}@{stt.provider}"
            if not stt.capabilities.streaming:
                pytest.skip(f"{label} does not support streaming")
  
            for attempt in range(MAX_RETRIES):
                try:
                    state = {"closing": False}
  
                    async def _stream_input(
                        frames: list[rtc.AudioFrame], stream: RecognizeStream, state: dict = state
                    ):
                        for frame in frames:
                            stream.push_frame(frame)
                            await asyncio.sleep(0.005)
  
                        stream.end_input()
                        state["closing"] = True
  
                    async def _stream_output(stream: RecognizeStream, state: dict = state):
                        text = ""
                        # make sure the events are sent in the right order
                        recv_start, recv_end = False, True
                        start_time = time.time()
                        got_final_transcript = False
  
                        async for event in stream:
                            if event.type == agents.stt.SpeechEventType.START_OF_SPEECH:

tests.test_stt::test_stream[livekit.plugins.speechmatics]

def finalizer() -> None:
        """Yield again, to finalize."""
  
        async def async_finalizer() -> None:
            try:
                await gen_obj.__anext__()
            except StopAsyncIteration:
                pass
            else:
                msg = "Async generator fixture didn't stop."
                msg += "Yield only once."
                raise ValueError(msg)
  
>       runner.run(async_finalizer(), context=context)

.venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py:330: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <asyncio.runners.Runner object at 0x7f5f0ac02750>
coro = <coroutine object _wrap_asyncgen_fixture.<locals>._asyncgen_fixture_wrapper.<locals>.finalizer.<locals>.async_finalizer at 0x7f5f09f4c380>

    def run(self, coro, *, context=None):
        """Run a coroutine inside the embedded event loop."""
        if not coroutines.iscoroutine(coro):
            raise ValueError("a coroutine was expected, got {!r}".format(coro))
  
        if events._get_running_loop() is not None:
            # fail fast with short traceback
            raise RuntimeError(
                "Runner.run() cannot be called from a running event loop")
  
        self._lazy_init()
  
        if context is None:
            context = self._context
        task = self._loop.create_task(coro, context=context)
  
        if (threading.current_thread() is threading.main_thread()
            and signal.getsignal(signal.SIGINT) is signal.default_int_handler
        ):
            sigint_handler = functools.partial(self._on_sigint, main_task=task)
            try:
                signal.signal(signal.SIGINT, sigint_handler)
            except ValueError:
                # `signal.signal` may throw if `threading.main_thread` does
                # not support signals (e.g. embedded interpreter with signals
                # not registered - see gh-91880)
                sigint_handler = None

tests.test_stt::test_stream[livekit.agents.inference]

stt_factory = <function parameter_factory.<locals>.<lambda> at 0x7f5f0a3218a0>
request = <FixtureRequest for <Coroutine test_stream[livekit.agents.inference]>>

    @pytest.mark.usefixtures("job_process")
    @pytest.mark.parametrize("stt_factory", STTs)
    async def test_stream(stt_factory: Callable[[], STT], request):
        sample_rate = SAMPLE_RATE
        plugin_id = request.node.callspec.id.split("-")[0]
        frames, transcript, _ = await make_test_speech(chunk_duration_ms=10, sample_rate=sample_rate)
  
        # TODO: differentiate missing key vs other errors
        try:
            stt_instance: STT = stt_factory()
        except ValueError as e:
            pytest.skip(f"{plugin_id}: {e}")
  
        async with stt_instance as stt:
            label = f"{stt.model}@{stt.provider}"
            if not stt.capabilities.streaming:
                pytest.skip(f"{label} does not support streaming")
  
            for attempt in range(MAX_RETRIES):
                try:
                    state = {"closing": False}
  
                    async def _stream_input(
                        frames: list[rtc.AudioFrame], stream: RecognizeStream, state: dict = state
                    ):
                        for frame in frames:
                            stream.push_frame(frame)
                            await asyncio.sleep(0.005)
  
                        stream.end_input()
                        state["closing"] = True
  
                    async def _stream_output(stream: RecognizeStream, state: dict = state):
                        text = ""
                        # make sure the events are sent in the right order
                        recv_start, recv_end = False, True
                        start_time = time.time()
                        got_final_transcript = False
  
                        async for event in stream:
                            if event.type == agents.stt.SpeechEventType.START_OF_SPEECH:

Skipped Tests

Test	Reason
`tests.test_stt::test_recognize[livekit.plugins.assemblyai]`	universal-streaming-english@AssemblyAI does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.speechmatics]`	enhanced@Speechmatics does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.fireworksai]`	unknown@FireworksAI does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.nvidia]`	unknown@unknown does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.cartesia]`	ink-2@Cartesia does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.soniox]`	stt-rt-v4@Soniox does not support batch recognition
`tests.test_stt::test_recognize[livekit.agents.inference]`	unknown@livekit does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.azure]`	unknown@Azure STT does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.aws]`	unknown@Amazon Transcribe does not support batch recognition
`tests.test_stt::test_recognize[ink-whisper@livekit.plugins.cartesia]`	ink-whisper@Cartesia does not support batch recognition
`tests.test_stt::test_recognize[livekit.plugins.deepgram.STTv2]`	flux-general-en@Deepgram does not support batch recognition
`tests.test_stt::test_stream[livekit.plugins.elevenlabs]`	scribe_v1@ElevenLabs does not support streaming
`tests.test_stt::test_recognize[livekit.plugins.gradium.STT]`	unknown@Gradium does not support batch recognition
`tests.test_stt::test_stream[livekit.plugins.fal]`	Wizper@Fal does not support streaming
`tests.test_stt::test_stream[livekit.plugins.mistralai]`	voxtral-mini-latest@MistralAI does not support streaming
`tests.test_stt::test_stream[livekit.plugins.openai]`	gpt-4o-mini-transcribe@api.openai.com does not support streaming

Triggered by workflow run #2165

charlotte-zhuang · 2026-05-27T04:24:40Z

@tinalenguyen I addressed all your comments!

I also ended up adding a third implementation of CartesiaRecognizeStream, which is why I ended up opting to keep it.

Copied from my PR description:

I also added a way to use both ink-whisper and ink-2 in conjunction with stream.flush().

I'm not entirely sure how this works in LiveKit, but we have some users who want to tell the model when turns end rather than our model telling them when the turn ended. Those users will see increased latency and unexpected transcripts without this "transcribe_on_flush" behavior.

The reason why I couldn't use the LegacyRecognizeStream is because that implementation streams out FINAL_TRANSCRIPT in chunks as soon as they become available. Then, LiveKit concatenates all FINAL_TRANSCRIPT speech events with spaces:
self._audio_transcript += f" {transcript}"
This space concatenation can cause issues with the transcript by inserting spaces in unexpected places. My solution was to create a new opt-in behavior where FINAL_TRANSCRIPT is only emitted after stream.flush() is called and all audio up until that point is flushed. This way, the entire human turn is put in a single transcript and the space concatenation issue is avoided.

An alternative solution would be to add a flag to SpeechEvent like transcript_event_sep: str = " " so that plugins can specify how their transcripts should be concatenated. Then, LegacyRecognizeStream can just emit transcripts with transcript_event_sep="".

CLAassistant · 2026-05-27T06:15:58Z

All committers have signed the CLA.

charlotte-zhuang added 4 commits May 23, 2026 22:00

feat(cartesia): add ink-2 stt

0be8da0

feat(cartesia): create cartesia example agent

7d0bcf5

docs(cartesia): add changeset

f4c1549

refactor(cartesia): add types for server messages

3fe673b

This comment was marked as resolved.

Sign in to view

charlotte-zhuang added 7 commits May 23, 2026 23:08

fix(cartesia): raise api errors rather than swallowing

013ff24

fix(cartesia): tigthen which errors are ignored

682c1e2

ci(cartesia): add ink-whisper stt test

ce47013

docs(cartesia): move where the api doc url is

623ea64

fix(cartesia): handle flush sentinels with ink-whisper

2beb3ea

docs(cartesia): update changeset

1ebaa4d

ci(cartesia): run make fix

cc4747f

This comment was marked as resolved.

Sign in to view

charlotte-zhuang added 2 commits May 24, 2026 07:34

fix(cartesia): correct error event schema

8d04745

fix(cartesia): send finalize when there is no audio

66ebad8

This comment was marked as resolved.

Sign in to view

fix(stt): reset state on reconnect

d811bc0

tinalenguyen reviewed May 26, 2026

View reviewed changes

charlotte-zhuang added 5 commits May 26, 2026 13:20

style(cartesia): remove override decorators

02d380e

fix(cartesia): undo language type hint narrowing

5014d51

feat(cartesia): support stream flushing

383ec38

feat(cartesia): support stream.flush()

29f36f6

remove unnecessary changeset

52c3d3e

This comment was marked as resolved.

Sign in to view

charlotte-zhuang requested a review from tinalenguyen May 27, 2026 04:39

This comment was marked as resolved.

Sign in to view

charlotte-zhuang added 4 commits May 26, 2026 23:16

fix(cartesia): do not wait for keepalive task to finish before exiting

3ec9c64

docs(cartesia): add docstring to cartesia.py example

af750cd

docs(cartesia): add flush example

5501f65

docs(cartesia): add flush eval example

8d69915

charlotte-zhuang force-pushed the main branch from aabffe7 to 8d69915 Compare May 27, 2026 06:17

		from .models import STTLanguages


		class CartesiaRecognizeStream(stt.RecognizeStream):

Conversation

charlotte-zhuang commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

transcribe_on_flush

Changes

Testing

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

charlotte-zhuang commented May 25, 2026

Uh oh!

Uh oh!

tinalenguyen May 26, 2026

Choose a reason for hiding this comment

Uh oh!

charlotte-zhuang May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tinalenguyen May 26, 2026

Choose a reason for hiding this comment

Uh oh!

charlotte-zhuang May 26, 2026

Choose a reason for hiding this comment

Uh oh!

tinalenguyen commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026

STT Test Results

Uh oh!

This comment was marked as resolved.

Uh oh!

charlotte-zhuang commented May 27, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

CLAassistant commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

charlotte-zhuang commented May 24, 2026 •

edited

Loading

charlotte-zhuang May 26, 2026 •

edited

Loading

CLAassistant commented May 27, 2026 •

edited

Loading