Joint Speech-Language Model

Chris Hua | 2024-02-16

Tincans’ mission is to deploy end-to-end conversational voice agents – talking to a bot should feel like talking to a human. We’re applying these voice agents to automate and improve the sales funnel for our customers.

Today, many people have built conversational voice AI systems utilizing a cascaded approach: first, converting speech to text via an ASR system, generating a reply to the transcribed text with an LLM, and finally generating text to speech via some other system. While the results can be impressive - Tincans’s v0 internal cascaded system can consistently achieve 300ms time to respond without filler words or caching - there are several fundamental gaps that limit the ability of these systems to scale.

Broadly, we believe that there are two core execution problems to solve for effective audio agents: 1) they must be enjoyable to talk to, 2) they must follow instructions well. These objectives tend to trade off against each other: instruction following in LLM’s mostly requires additional compute (e.g. ‘think before you speak’ or ‘please please return the reply in this JSON format’ or ‘here’s some few shot examples’ or ‘generate 16 replies and pick the most common one’), while smooth conversation needs quicker responses.

Constraints breed creativity. Thankfully, our problem domain also has several important constraints: our customers want real-time chat that is tuned to their company’s specific workflows. This means that we do not need to try and solve for every functionality (like the big foundation model providers need to) and that we should prefer smaller, faster models (for speed). Additionally, speech and language are closely related domains, so we believe that a relatively limited amount of resources can be sufficient to bridge the gap.

We believe that the path forward is with combined speech-language models, that is, a model that takes in audio and outputs a reply in text. These models should have much faster latency, be more robust to speech disfluencies and errors, and be guidable towards our desired output formats.

Existing research has validated that speech-language models are possible; this work places a minimum bound on the cost associated with training one. Most work involves foundation size models and 100k hour datasets - eg Google’s SLM model has 15B+ params pretrained on 12 million hours of audio - but we wanted to see how far we could go with existing open source models and the limited data that we could clean in time. You can liken this to genome sequencing: in the early 2000’s, the Human Genome Project raised almost $3 billion, but now sequencing costs consumers around $1000. Of course this problem is solvable with massive GPU clusters, what can we do with just one spot instance?

To understand and derisk this approach, we’ve trained (or spiked)) a minimum viable model – it’s able to correctly handle audio input just like text and it’s very fast (time-to-first-token <100ms on consumer hardware). It’s not an intelligent model yet – but we believe that this will come with better instruction tuning. Of more interest to us are understanding the costs for solving the problem and the theoretical performance (a stronger model will infer in the same amount of time as a dumb one).

Our results lead us to believe that it’s possible to train and deploy real-time conversational voice AI with this general approach in the near future, for far less cost than a frontier lab. Because the computational cost of this approach is only marginally higher than running a Llama 7B model, which has been shown to run quickly on a variety of consumer edge devices, it also presents a reasonable path forward for running similar models and experiences on device.


We take a frozen wav2vec2 model and a frozen Llama 2 7B Chat model, and only train a projection layer between the audio representation and the text embeddings. Basically, we learn a representation for the sequence of audio, then map each audio output into the text embedding space. The audio had SpecAugment applied to it via the Wav2Vec2 module. We downsample 8x, so each audio ‘pseudo-token’ represents 80 milliseconds of speech. The projection layer is a simple MLP following Llava 1.5 (code in appendix).

We pretrain the model with a mix of ‘transcription’ and ‘conversational reply’ tasks. The dataset used for this run has about 1000 hours of speech, and each example is between 5 and 30 seconds long. We feed an instruction in chat format to the model, alternating between a few different task instructions for each type of task, and insert audio ‘pseudo-tokens’ corresponding to the audio at different points in the query.

A training example:

            {"role": "user", "content": "Listen to <|audio|> and respond to it"}
            {"role": "assistant", "content": example['response']"}

As part of the forward, we replace the audio token with multiple embedding vectors corresponding to the audio.

Training was completed on a single A100-40GB on Google Cloud. We targeted an effective batch size of 64 (true batch size 4, gradient accumulation steps 16) and one epoch. Learning rate followed a Noam schedule with 3% warmup and cosine decay, with a peak rate of 2e-3. We trained modules in bfloat16 whenever possible. The final pretrain run completed in around 6 hours with this setup.

We additionally ran a LoRA finetuning process for 100 steps with effective batch size 64 over the same dataset, with similar learning rate schedule but peak 2e-4. We report these results separately. The trainable parameters in this step are 163M, or 2.3% of the total 7B.


To level-set expectations again, this model was trained with less compute than the celebratory sushi dinner (both in terms of dollars and probably energy consumed to ship the fish around). We’re not trying to set SOTA benchmarks or grind for a paper submission, we know that with a ton of compute those things are possible, we are trying to validate if this is possible without frontier lab budgets.

We present results for the base and lightly finetuned models below. We used a mix of out of domain data, synthesized by Bark or recorded by the author, and sentences from Librispeech, The New York Times and The Ringer. These results are obviously cherry-picked but hopefully show the emergence of capabilities.

Instruction Input Base Output LoRA Output
Mr. Biden’s lawyers, who were present for Mr. Hur’s five-hour interview with Mr. Biden, have called those and similar comments in the report both inaccurate and gratuitous, noting that Mr. Hur found he could not prove a case against Mr. Biden anyway for other reasons. Mr. Biden's lawyers, who were present for Mr. Hunter's five-hour interview with Mr. Biden, have called those similar comments in the report both inaccurate and gratuitous, noting that Mr. Biden could not prove a case against Mr. Trump, Mr. Biden's lawyers said. Miss Birkenhead's lawyers, who were present for Miss Hers' five-hour interview with Miss Birden, have called those and similar comments in the report both inaccurate and gratuitous, noting that Miss Hersound he could not prove a case against Miss Birden anyway for another reason.
Repeat exactly what is written here:
Previously undisclosed letters show a testy exchange over whether the Justice Department violated its own guidelines in releasing a report with comments disparaging President Biden. Proudly on this close letter, show a testy exchange over whether the justices departed, violated its own guidelines. in releasing a report with comments disregarding precedent, violating its own guidelines. Previously on this close letter, show a testy exchange over whether the justices departed from their usual demeanor, violated its own guidelines in releasing a report with comments disregarding President Biden.
– summarize this
Previously undisclosed letters show a testy exchange over whether the Justice Department violated its own guidelines in releasing a report with comments disparaging President Biden. The article discusses the role of the United States in the global fight against COVID-19, particularly in the context of the ongoing pandemic in India. The author argues that the US has been slow to respond to the crisis, particularly in comparison to India, which has been praised for its quick and decisive action in containing the virus. The Closest Shave: A Testy Exchange Over Whether the Justices Departed, Violated It's Own Guidelines in Releasing a Report with Comments Disagreeing with Professor Frieden.
Respond to
What’s the meaning of life? I think that's a really interesting question, because I think that's something that a lot of people have been thinking about for a long time. And I think that the idea of the meaning of life is something that's been explored in philosophy and literature and religion and psychology and all these different fields. And I think that there are Well, it's a very interesting question. You know, I think that we, as humans, are always trying to find ways to make sense of the world around us. And the way that we do that is through stories. And so, I think that the story of the life of Jesus, as we find it in the New Testament, is one of the most powerful stories that we have.
Continue the conversation after
CHAPTER SIXTEEN I MIGHT HAVE TOLD YOU OF THE BEGINNING OF THIS LIAISON IN A FEW LINES BUT I WANTED YOU TO SEE EVERY STEP BY WHICH WE CAME I TO AGREE TO WHATEVER MARGUERITE WISHED Sure, I'd be happy to continue the conversation. So, you mentioned that Chapter Sixteen of the book talks about the beginning of the Liaison, which is a new line of fashion. But before that, you mentioned that you wanted to see every step by which we came to whatever magazine wished. Can you tell us more about that? So, I'm going to try to summarize what you've said. \n\nYou've told me that you've been in a relationship with this person for a while, but you've been feeling like you're in a rut. You're feeling like you're not growing, you're not evolving, you're not learning new things. You're feeling like you're stuck in a pattern, and it's not fulfilling for you.
And … will the quarterback market get reset — and how many times? And will the quarterback market get reset? And how many times will he have to reset it? And will the quarterback market get reset? And how many times?
– respond
And … will the quarterback market get reset — and how many times? Oh, my goodness. I think it's a great question. I think it's a great question. And I think it's something that we're all grappling with in different ways. And I think it's one of the things that I think we're all trying to figure out right now. I think that, you know, the idea of what it means to be a woman, what it means to be a feminist, what it means to be a leader, what it means to be a mother, what it means to be a wife, what it means to be a friend I think it's important to note that, you know, the term quarterback marker is not a term that I would use. It's not a term that I think is useful. I think it's a term that's been created to try to make quarterbacks seem like they're not quarterbacks. They're not quarterbacks. They're not the ones that make the plays.

In general, we think these results show some level of textual and acoustic understanding despite the low level of training. The model performs quite well on transcription and is passable at responding/summarizing/generally responding to out of domain responses. The replies are also a bit silly, perhaps, but reflect the source data of spontaneous conversation and natural disfluencies in how people talk, instead of how they write.

Our theory is that transcription is too ‘easy’ and that we instead need to focus on a broader mix of textual understanding tasks in the audio domain. Future work will focus more heavily on responding (see also Wang 2023 and Fathullah 2023) and generating a conversational reply. Additionally, generating better captions is promising for the tasks and for the assistant persona, or using multiple turns of conversation. Finally, we used a wav2vec2 model that was finetuned for ASR already, which may have guided our results too far towards ASR.

The pretraining loss curve shows some interesting results. We see understanding emerge quickly, after around 1k steps with our setup, which corresponds to around 350 hours of audio. Earlier checkpoints show no real learning. 350 hours is all you need??

We will release the model on HuggingFace if there is demand - the code was intended to only be run once and will take a bit of work to cleanup.

We're building with a select number of alpha partners. If you're interested in learning more, we'd love to hear from you.