Whose Line Is It Anyway? Creating AI That Accurately Separates Voices on Sales Calls

March 16, 2018

Shawn Parrotte

Chorus.ai is a platform that automatically records, transcribes and summarizes sales and customer success conversations. Understanding who was on a call and who said what is an easy task for humans (most of the time!), but it's a much harder problem than you might think for algorithms. This post describes some of the challenges pertaining to data accuracy and the benefits of building a solution using our own tech stack rather than using a third party solution.

Analysis performed by Orgad Keller, Amit Ashkenazi, Raphael Cohen (PhD), and Micha Breakstone (PhD)

Why would you want to know who said what during a meeting?

  • Talk-time ratio across reps and prospects is a good indicator of conversation quality (see graph below)
  • It’s helpful to listen to only what prospects said when reviewing a recording
  • Isolating a specific speaker makes it easy to see who was engaged in the conversation (e.g., the CFO didn’t say much, but I want to hear what she said)
  • Algorithms that identify important moments (like pain points) perform much better if you can isolate them to only what a prospect is saying
3 16 18 whos line is it 01

Identifying multiple speakers on a single channel audio recording, especially in conference calls with more than two speakers, is an extremely difficult challenge and is far from a solved problem, even in academic settings. Usually referred to as "speaker separation" or "speaker diarization", it’s generally considered even more difficult than speech recognition.

To attack this problem, our research team developed a patent-pending framework that uses Deep Learning to automatically generate a “voice fingerprint” for each sales rep using a combination of vocal characteristics. During the sales call itself, we cluster the audio signals based on those characteristics with each cluster representing a speaker. The voice fingerprints we stored play a crucial role not only in associating each speaker with the right cluster, but in the clustering process itself: the models we trained with the fingerprints allow us to learn and apply mathematical transformations to the audio, which render the differences between different speakers more distinct. See the before and after graphs below.

3 16 18 whos line is it 02a

Speaker Separation Before Applying Learned-Transformations for a Sales Call with 5 Speakers #

Each point represents features and statistics that were extracted from a few seconds of speech, ant the color of the point represents the identity of the speaker. There are 5 different colors, representing 5 different speakers who participated in the call. We applied the t-SNE algorithm to represent many dimensions of features in 2-dimensions.

3 16 18 whos line is it 02b

Speaker Separation After Applying Learned-Transformations for a Sales Call with 5 Speakers #

We plotted the exact same data as above after we applied our mathematical transformations. Notice how the difference between different speakers are much clearer now. Our unique transformations made it easier for the algorithms to cluster the utterances into different speakers, and then to associate each cluster with the respective speaker. Machine learning algorithms are not perfect, and you can see some outliers, like the orange points within the purple cluster. In most cases these utterances were mis-classified because they were extremely short, and hence have a negligible effect on calculating talk time.

Using this approach Chorus is able to better separate the voices of speakers, even in challenging circumstances, such as when two speakers have dialed in from the same conference room (e.g., your champion and the decision maker), or in-person meetings in a noisy acoustic environment. We’ve submitted a patent for this pioneering approach, and as of now, no other provider on the market has developed a comparable solution.

There are other important statistics that come in to play for understanding talk-time, such as understanding when a conversation actually begins.

Let’s look at a couple of examples illustrating this using screenshots from Chorus.ai.

In the first example both the rep and prospect joined the call late. Chorus marks the pre-call segment in gray. This allows us to identify the true length of the meeting.

3 16 18 whos line is it 03a

In the second example the rep joined the call first. The prospect joined a few minutes later and asked to wait a minute until her peer joined. Understanding this is critical for transcription accuracy. Other speech recognition frameworks would attempt to transcribe the silence / noise and miscalculate talk time, providing inaccurate data.

3 16 18 whos line is it 03b

As simple as this may seem for a human to do, it’s surprisingly complex for algorithms to do this at scale, but it is critical for creating high quality data that our customers and users can rely on.

At Chorus.ai, our goal is to automatically summarize and surface insights across all of your business conversations and get those insights to the people that need them. Our investment in R&D and building a technology stack optimized for sales and customer success conversations allows us to capture 100% of meetings and provide the highest quality data possible — from transcript accuracy, to engagement in the conversation and who is saying what. In future posts we’ll share more on how these insights impact sales performance and act as indicators on the likelihood of a deal to close or renew.

Are You Ready to Experience Chorus?

Start driving tangible performance improvements in your Revenue Org today.