feat(plugin-openai): stream input_audio_transcription delta events#1581
Conversation
Add support for `conversation.item.input_audio_transcription.delta` events on the OpenAI Realtime API. These are emitted by the new `gpt-realtime-whisper` transcription model (and any future delta-emitting STT model) as audio is processed; today the plugin's RealtimeSession event switch handles only `.completed` and `.failed`, so deltas are silently dropped and user transcripts surface to the client only after the entire utterance finalises.
🦋 Changeset detectedLatest commit: bd0bcca The changes in this PR will be included in the next version bump. This PR includes changesets to release 33 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
hey brian! 👋 hope all is well - i have a quick PR for streaming text output within the open ai voice realtime model. when you get a moment, can you review? let me know if anything looks off or you'd like me to update anything. thanks! |
|
We had a comment in python: elif event["type"] == "conversation.item.input_audio_transcription.delta":
# currently incoming transcripts are transcribed only after the user stops speaking
# it's not very useful to emit these as the transcribe process takes place within ~100ms
# when they handle streaming transcriptions, we'll handle it then.
passIs this not the case anymore based on your testing? |
yea, fair question. but to that comment above -- https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/ lmk if i'm misinterpreting anything, though! thanks for the quick response |
|
@toubatbrian just following up here after a long weekend to get back on your radar 👍 |
Description
Adds support for
conversation.item.input_audio_transcription.deltaevents on the OpenAI Realtime API. These are emitted by the newgpt-realtime-whispertranscription model (and any future delta-emitting STT model) as audio is processed. Today the plugin'sRealtimeSessionevent switch only handles.completedand.failed, so deltas are silently dropped and user transcripts surface to the client only after the entire utterance finalizes.Changes Made
plugins/openai/src/realtime/api_proto.ts: addConversationItemInputAudioTranscriptionDeltaEventtoServerEventType/ServerEvent.plugins/openai/src/realtime/realtime_model.ts: add per-item delta accumulator (inputTranscriptAccumulators), wire.deltainto the event switch, emit accumulated text asisFinal: false, and clear the accumulator on.completed.remoteChatCtxis intentionally only updated on.completed— partials shouldn't mutate persistent chat history.plugins/openai/src/realtime/realtime_model.test.ts: 177 lines of new tests covering delta accumulation, ordering, accumulator cleanup on completion, mixed delta/non-delta sessions, and the back-compat path for models that only emit.completed.examples/src/realtime_streaming_transcript.ts: runnable demo usinggpt-realtime-whisper, with inline notes on how partials replace (not append to) prior text per turn.Pre-Review Checklist
Testing
restaurant_agent.tsandrealtime_agent.tswork properly (for major changes)Manual repro:
pnpm build && node ./examples/dist/realtime_streaming_transcript.js dev, join the agent's room from the Playground, and speak a long sentence. Expect[user transcript partial]lines streaming to stdout as you speak, ending with one[user transcript FINAL].Additional Notes
Note to reviewers: Please ensure the pre-review checklist is completed before starting your review.