Speech-to-text apps: Microsoft vs Google - which is the best for dictation?

Speech-to-text apps: Microsoft vs Google - which is the best for dictation?
(Image credit: Pixabay)

Speech-to-text software has come a long way in recent years. Much of the gains in speed and accuracy are thanks to improvements in artificial intelligence, which undergirds these apps.

So, it should come as little surprise that two of the biggest names in AI—Microsoft and Google—are also major players in developing voice to text apps. Microsoft Azure Speech Service and Google Cloud Speech-to-Text are leading platforms for voice typing, transcription, and productivity.

But when push comes to shove and you have to choose one of these platforms over the other, which is better? In this guide, we’ll compare the Microsoft and Google speech-to-text apps to help you decide.

Features

Microsoft Azure Speech Service and Google Cloud Speech-to-Text overlap if you need basic audio transcription. But for more advanced voice dictation applications, the two platforms have different strengths. 

Google’s software stands out for its multi-language support. Speech-to-Text is capable of transcribing audio in any of 120 languages to text. By comparison, Microsoft’s speech to text software only supports 29 languages at this time. Google’s platform will even automatically detect the language of the recording and will recognize proper nouns so that you don’t have to worry about formatting and capitalization later on.

Google Cloud Speech-to-Text supports punctuation and recognizes multiple speakers in recordings. (Image credit: Google)

Microsoft Azure Speech Service is more feature-rich when it comes to getting your transcription exactly right. You can feed the software a custom speech model to help you improve accuracy for a single speaker or for speakers with a regional accent. Or, Speech Service supports acoustic models that you can use to cancel out noise in your recordings. This is especially helpful if you frequently experience audio noise in a conference room or over a headset.

Speech Service’s API also enables you to code real-time feedback. So, if the software is having trouble recognizing words, it could prompt the speaker to talk more slowly or clearly to achieve better results.

Both Microsoft and Googles’ platforms automatically detect when there are multiple speakers in a recording. So, you can easily use either of these speech-to-text apps for transcribing meetings and conference calls.

Performance

For straightforward audio transcription, Microsoft Azure Speech Service tends to perform better than Google Cloud Speech-to-Text. The difference is that Microsoft’s software uses AI to make sure that what it’s transcribing makes linguistic sense. Since this software can accept custom speech models, it also handles accents, lisps, and other speech impediments significantly better than Google’s Speech-to-Text platform.

Google largely sticks to recognizing words based on their audio signatures and stringing them together. This means that when the software is struggling with audio quality or interpreting an accent, the transcription quality can suffer quite a bit.

All that said, getting better results from Microsoft’s software is dependent on using high-quality speech and acoustic models. If you skip this step, you may find that the two platforms are much more comparable in their accuracy when transcribing difficult recordings. Feeding Speech Service poor models can also hurt your transcription and leave you with a less accurate result.

You can try Microsoft Azure Speech Services for free before committing to the app. (Image credit: Microsoft)

We found that the two apps are also very comparable when it comes to recognizing multiple speakers. This feature isn’t always perfectly accurate if you have two people with a similar tone and a less than crisp recording. But most of the time, both Speech Service and Speech-to-Text were each able to differentiate speakers on a conference call within the transcribed text.

Support

Google Cloud Speech-to-Text doesn’t come with much support by default. You’ll find some basic troubleshooting tips online, but otherwise Google directs you to ask the community for help on Stack Overflow or Slack. You can purchase a support plan from Google if you need to talk to a tech. Options start at $100 per user per month.

Microsoft offers more online documentation for its Speech Service software, including how-to videos and example code for the platform API. But, you’ll also need to pay extra if you want support from Microsoft techs. Email-only support plans start at $29 per user per month, while phone support plans start at $100 per user per month.

Support plans for Microsoft Azure Speech Service. (Image credit: Microsoft)

Pricing and plans

On its face, Microsoft Azure Speech Service is significantly cheaper than Google Cloud Speech-to-Text. Microsoft offers five hours of free transcription per month and then charges $1 per hour of audio after that. Google provides just one hour of free transcription, after which the service costs $1.44 per hour of audio.

Pricing for Google Cloud Speech-to-Text. (Image credit: Google)

That said, pricing with either of these services can be complex. Google offers a 30% discount if you allow the company to log your audio data on its servers. In that case, Speech-to-Text is slightly cheaper than Microsoft’s Speech Service. At the same time, Google charges $2.16 per hour if you want to use the ‘Enhanced’ speech model. Microsoft raises its price to $1.40 per hour of audio if you supply custom speech or acoustics models.

Verdict

For most cases in which you need to transcribe speech-to-text, we recommend Microsoft Azure Speech Service. It’s significantly cheaper than Google Cloud Speech-to-Text if you have many hours of audio. We also found that it can be much more accurate if you take the time to supply custom speech and acoustics models with your recordings.

That said, Microsoft’s language support is very limited compared to Google’s. So, if you want one app that can handle recordings in nearly any language, Google Cloud Speech-to-Text may be the better option.

Michael Graw

Michael Graw is a freelance journalist and photographer based in Bellingham, Washington. His interests span a wide range from business technology to finance to creative media, with a focus on new technology and emerging trends. Michael's work has been published in TechRadar, Tom's Guide, Business Insider, Fast Company, Salon, and Harvard Business Review.