Skip to content

feat(plugin-openai): stream input_audio_transcription delta events#1581

Merged
toubatbrian merged 1 commit into
livekit:mainfrom
jjsquillante:feat/realtime-input-audio-transcription-delta
May 26, 2026
Merged

feat(plugin-openai): stream input_audio_transcription delta events#1581
toubatbrian merged 1 commit into
livekit:mainfrom
jjsquillante:feat/realtime-input-audio-transcription-delta

Conversation

@jjsquillante
Copy link
Copy Markdown
Contributor

@jjsquillante jjsquillante commented May 22, 2026

Description

Adds support for conversation.item.input_audio_transcription.delta events on the OpenAI Realtime API. These are emitted by the new gpt-realtime-whisper transcription model (and any future delta-emitting STT model) as audio is processed. Today the plugin's RealtimeSession event switch only handles .completed and .failed, so deltas are silently dropped and user transcripts surface to the client only after the entire utterance finalizes.

Changes Made

  • plugins/openai/src/realtime/api_proto.ts: add ConversationItemInputAudioTranscriptionDeltaEvent to ServerEventType / ServerEvent.
  • plugins/openai/src/realtime/realtime_model.ts: add per-item delta accumulator (inputTranscriptAccumulators), wire .delta into the event switch, emit accumulated text as isFinal: false, and clear the accumulator on .completed. remoteChatCtx is intentionally only updated on .completed — partials shouldn't mutate persistent chat history.
  • plugins/openai/src/realtime/realtime_model.test.ts: 177 lines of new tests covering delta accumulation, ordering, accumulator cleanup on completion, mixed delta/non-delta sessions, and the back-compat path for models that only emit .completed.
  • examples/src/realtime_streaming_transcript.ts: runnable demo using gpt-realtime-whisper, with inline notes on how partials replace (not append to) prior text per turn.

Pre-Review Checklist

  • Build passes: All builds (lint, typecheck, tests) pass locally
  • AI-generated code reviewed: Removed unnecessary comments and ensured code quality
  • Changes explained: All changes are properly documented and justified above
  • Scope appropriate: All changes relate to the PR title, or explanations provided for why they're included
  • Video demo: A small video demo showing changes works as expected and did not break any existing functionality using Agent Playground (if applicable)

Testing

  • Automated tests added/updated (if applicable)
  • All tests pass
  • Make sure both restaurant_agent.ts and realtime_agent.ts work properly (for major changes)

Manual repro: pnpm build && node ./examples/dist/realtime_streaming_transcript.js dev, join the agent's room from the Playground, and speak a long sentence. Expect [user transcript partial] lines streaming to stdout as you speak, ending with one [user transcript FINAL].

[21:06:59.936] INFO (47252): playout completed with interrupt
    speech_id: "speech_73d3d7b6-440"
    message: "Hello! Please say a long sentence, and you’ll see your words appear on the screen as you speak."
[21:06:59.937] DEBUG (47252): Task.runTask: task AgentActivity.realtimeReply done
[21:07:01.446] INFO (47252): [user transcript partial]
    transcript: " Hey, um"
[21:07:02.242] INFO (47252): [user transcript partial]
    transcript: " Hey, um yes, so"
[21:07:03.231] INFO (47252): [user transcript partial]
    transcript: " Hey, um yes, so the quick"
[21:07:03.832] INFO (47252): [user transcript partial]
    transcript: " Hey, um yes, so the quick brown"
[21:07:04.433] INFO (47252): [user transcript partial]
    transcript: " Hey, um yes, so the quick brown fox jumped"
[21:07:05.031] INFO (47252): [user transcript partial]
    transcript: " Hey, um yes, so the quick brown fox jumped over the"
[21:07:06.229] INFO (47252): [user transcript partial]
    transcript: " Hey, um yes, so the quick brown fox jumped over the lazy dog"
[21:07:07.021] INFO (47252): [user transcript partial]
    transcript: " Hey, um yes, so the quick brown fox jumped over the lazy dog as the"
[21:07:08.225] INFO (47252): [user transcript partial]
    transcript: " Hey, um yes, so the quick brown fox jumped over the lazy dog as the text streamed"
[21:07:08.635] INFO (47252): [user transcript partial]
    transcript: " Hey, um yes, so the quick brown fox jumped over the lazy dog as the text streamed on the"
[21:07:09.024] INFO (47252): [user transcript partial]
    transcript: " Hey, um yes, so the quick brown fox jumped over the lazy dog as the text streamed on the screen"
[21:07:09.245] INFO (47252): onInputSpeechStopped
    userTranscriptionEnabled: true
[21:07:09.262] INFO (47252): Creating speech handle
    speech_id: "speech_73fc4af8-cf9"
[21:07:09.262] DEBUG (47252): Task.runTask: task AgentActivity.realtimeGeneration started
[21:07:09.263] DEBUG (47252): realtime generation started
    speech_id: "speech_73fc4af8-cf9"
    stepIndex: 1
[21:07:09.264] DEBUG (47252): Task.runTask: task AgentActivity.realtime_generation.read_messages started
[21:07:09.264] DEBUG (47252): Task.runTask: task AgentActivity.realtime_generation.read_tool_stream started
[21:07:09.264] DEBUG (47252): Task.runTask: task performToolExecutions started
[21:07:09.545] DEBUG (47252): Task.runTask: task performTextForwarding started
[21:07:09.546] DEBUG (47252): Task.runTask: task performAudioForwarding started
[21:07:09.704] INFO (47252): [user transcript FINAL]
    transcript: "Hey, um yes, so the quick brown fox jumped over the lazy dog as the text streamed on the screen"
[21:07:10.688] DEBUG (47252): Task.runTask: task performTextForwarding done
[21:07:10.689] DEBUG (47252): Closing generation channels in handleResponseDone
    messageCount: 1
[21:07:10.693] DEBUG (47252): Task.runTask: task AgentActivity.realtime_generation.read_tool_stream done
[21:07:10.698] DEBUG (47252): Task.runTask: task performToolExecutions done
[21:07:14.309] DEBUG (47252): Task.runTask: task performAudioForwarding done
[21:07:14.312] DEBUG (47252): Task.runTask: task AgentActivity.realtime_generation.read_messages done

Additional Notes


Note to reviewers: Please ensure the pre-review checklist is completed before starting your review.

Add support for `conversation.item.input_audio_transcription.delta`
events on the OpenAI Realtime API. These are emitted by the new
`gpt-realtime-whisper` transcription model (and any future
delta-emitting STT model) as audio is processed; today the plugin's
RealtimeSession event switch handles only `.completed` and `.failed`,
so deltas are silently dropped and user transcripts surface to the
client only after the entire utterance finalises.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 22, 2026

🦋 Changeset detected

Latest commit: bd0bcca

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 33 packages
Name Type
@livekit/agents-plugin-openai Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-cerebras Patch
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-fishaudio Patch
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-hume Patch
@livekit/agents-plugin-inworld Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-perplexity Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-sarvam Patch
@livekit/agents-plugin-xai Patch
@livekit/agents Patch
@livekit/agents-plugin-assemblyai Patch
@livekit/agents-plugin-baseten Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-hedra Patch
@livekit/agents-plugin-lemonslice Patch
@livekit/agents-plugin-liveavatar Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-minimax Patch
@livekit/agents-plugin-mistral Patch
@livekit/agents-plugin-mistralai Patch
@livekit/agents-plugin-phonic Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-runway Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugin-tavus Patch
@livekit/agents-plugin-trugen Patch
@livekit/agents-plugins-test Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@jjsquillante
Copy link
Copy Markdown
Contributor Author

@toubatbrian

hey brian! 👋 hope all is well - i have a quick PR for streaming text output within the open ai voice realtime model.

when you get a moment, can you review? let me know if anything looks off or you'd like me to update anything.

thanks!

@toubatbrian
Copy link
Copy Markdown
Contributor

toubatbrian commented May 22, 2026

We had a comment in python:

                    elif event["type"] == "conversation.item.input_audio_transcription.delta":
                        # currently incoming transcripts are transcribed only after the user stops speaking
                        # it's not very useful to emit these as the transcribe process takes place within ~100ms
                        # when they handle streaming transcriptions, we'll handle it then.
                        pass

Is this not the case anymore based on your testing?

@jjsquillante
Copy link
Copy Markdown
Contributor Author

when they handle streaming transcriptions, we'll handle it then.

yea, fair question. but to that comment above -- gpt-realtime-whisper was released a few weeks ago so this PR now handles this exact case. The example in the PR description above shows how we're now able to handle partials as the user speaks, before the .completed event fires at the end of the turn.

https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/

lmk if i'm misinterpreting anything, though! thanks for the quick response

@jjsquillante
Copy link
Copy Markdown
Contributor Author

@toubatbrian just following up here after a long weekend to get back on your radar 👍

Copy link
Copy Markdown
Contributor

@toubatbrian toubatbrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@toubatbrian toubatbrian merged commit f21488d into livekit:main May 26, 2026
6 checks passed
@github-actions github-actions Bot mentioned this pull request May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants