feat(cartesia): add ink-2 stt#5827
Conversation
|
It looks like
|
| samples_per_channel=samples_50ms, | ||
| ) | ||
| @dataclass | ||
| class STTOptions: |
There was a problem hiding this comment.
do you foresee more parameters introduced in the future that would apply to both models? i'm a bit hesitant on deprecating this at the moment
There was a problem hiding this comment.
I actually deprecated STTOptions since this class appears to be for internal use only.
The one public use was in SpeechStream.__init__(), but realistically users get a SpeechStream instance by calling STT.stream(), not by constructing their own instances. That was why I made SpeechStream an ABC, which removed the need for STTOptions.
| from .models import STTLanguages | ||
|
|
||
|
|
||
| class CartesiaRecognizeStream(stt.RecognizeStream): |
There was a problem hiding this comment.
i wonder if it would be possible to condense this to something like:
CartesiaRecognizeStream: TypeAlias = TurnsRecognizeStream | LegacyRecognizeStream
what do you think of this?
There was a problem hiding this comment.
Do you know if users call methods on the return of STT.stream()? i.e. is it important that all my RecognizeStream implementations have the same interface?
I just wanted to avoid a situation where you would need to do something like this
stt = cartesia.STT(model="ink-whisper")
stream = cartesia.stream()
cast(stream, Any).update_options(language="es")So the CartesiaRecognizeStream ABC helps to make sure all our RecognizeStream implementations follow the same interface.
If I were to implement this for the first time, I would probably make our SpeechStream.update_options method private.
|
/test-stt |
STT Test ResultsStatus: ✗ Some tests failed
Failed Tests
Skipped Tests
Triggered by workflow run #2165 |
|
@tinalenguyen I addressed all your comments! I also ended up adding a third implementation of Copied from my PR description:
|
Adds the Cartesia Ink 2 STT model.
This model has a different API from Ink Whisper and supports different features, most importantly turn detection.
A decision was made to use the same
cartesia.STT()class for both the new and old API to avoid confusion as to which class to use.transcribe_on_flush
I also added a way to use both
ink-whisperandink-2in conjunction withstream.flush().I'm not entirely sure how this works in LiveKit, but we have some users who want to tell the model when turns end rather than our model telling them when the turn ended. Those users will see increased latency and unexpected transcripts without this "transcribe_on_flush" behavior.
The reason why I couldn't use the
LegacyRecognizeStreamis because that implementation streams outFINAL_TRANSCRIPTin chunks as soon as they become available. Then, LiveKit concatenates allFINAL_TRANSCRIPTspeech events with spaces:This space concatenation can cause issues with the transcript by inserting spaces in unexpected places. My solution was to create a new opt-in behavior where
FINAL_TRANSCRIPTis only emitted afterstream.flush()is called and all audio up until that point is flushed. This way, the entire human turn is put in a single transcript and the space concatenation issue is avoided.An alternative solution would be to add a flag to
SpeechEventliketranscript_event_sep: str = " "so that plugins can specify how their transcripts should be concatenated. Then,LegacyRecognizeStreamcan just emit transcripts withtranscript_event_sep="".Changes
These changes could be considered breaking in a way, but they should not affect expected usage patterns:
ink-2START_OF_SPEECH,INTERIM_TRANSCRIPT,PREFLIGHT_TRANSCRIPT, andEND_OF_SPEECHevents will be emittedupdate_options(model="...")is now a no-op. There is no reason to update your model mid-session so I'm not sure why this was added in the first place. This should not negatively effect any users since before Ink 2, there was only a single Ink Whisper snapshot, making the method even more bewildering.SpeechStreamis now anABCto allow for different behavior based on the model. It is highly unexpected for users to be constructing their owncartesia.stt.SpeechStreaminstances.self._FlushSentineland sends"finalize"commands to the serverThese changes are purely additive:
TurnDetectingRecognizeStream, which is the default forink-2TranscribeOnFlushRecognizeStream, which makesstream.flush()control when the final transcript is emittedcartesia.pyexample agentAUDIO_ENCODINGconstantTesting
Tested using
cartesia.pyusing bothink-2andink-whisper.Also verified that
make checkpasses and existing tests pass.