What is Web Speech Recognition?
What is a "voice-driven" web app? It's a web app that can activate the microphone and turn the user's speech into text that it can process. That text can either be displayed to the user or interpreted as commands. That means controlling the web with nothing but your voice!
How can a web developer build such a web app? There are two ways:
- Use the native support for speech recognition that's built in to some web browsers
- Connect their web app to a cloud platform that provides speech recognition services. For example: Speechly, AWS Transcribe, Azure Cognitive Services
Native speech recognition
- Google contributed heavily to the specification for these APIs
- Google own a speech recognition platform (Google Cloud Speech-to-Text API) and are able to bake a client for this API into Chrome
Other browsers are slowly catching up, notably Safari which now offers a Siri-based equivalent.
Cloud speech recognition
Although the Web Speech API offers a convenient way to start building voice features in your web app, it does have some limitations:
- Browser support is limited. On other browsers, voice features won't work natively and will need to be replaced by fallback features. This can result in a fragmented user experience across browsers. One example is Duolingo, which only offers its voice exercises on Chrome
- Across the browsers that do support it, implementations are different:
- Words may be transcribed correctly by some browsers but not by others
- Words may be transcribed incorrectly in different ways
- Words may be formatted differently
- Implementations across browsers are upgraded in a different cadence and something that worked previously in one browser might not work in the next upgrade
So what is the alternative? How can we make a voice-driven web app that works consistently across all browsers? The answer is to pick your favourite cloud vendor and use whatever speech recognition service they offer. That means you will need a web client that can do the following:
- Get audio from the microphone
- Stream that audio to the service
- Process the transcription stream that comes back
Building this yourself can be tricky, so you may want to use a web client written by the cloud vendor (if they have one!) or use a polyfill for the Web Speech API. What's a polyfill though?
A polyfill is a piece of code that implements missing browser functionality. The Web Speech API doesn't exist on many browsers, so there are polyfills that fill in that gap, using a specific cloud vendor to provide the speech recognition under the hood. The polyfill API itself will more or less meet the specification. This means that polyfills can easily be swapped out with either the native implementation on browsers that support the API natively, or with polyfills for other vendors if you choose to switch.
Building React apps with speech recognition
How can you integrate speech recognition in your web app? If you're a React developer, this is easy and doesn't require you to build any speech recognition client or know how to use the Web Speech API. The most popular solution is a package called react-speech-recognition. This is a React hook that passes transcripts from the microphone into your React app, and allows you to specify voice commands. It is compatible with any implementation of the Web Speech API, meaning you just need to plug in the polyfill for your chosen cloud vendor and it will work consistently across all browsers. It also works with native implementations (e.g. Chrome) out of the box and tells your React app when the browser doesn't support the API natively. This is handy for building prototypes without needing to set up a cloud vendor account.
You can get started with a basic example here.
Web speech recognition still has a young ecosystem, with some notable gaps:
- Lack of browser support
- Limited selection of polyfills
- Limited support for other web frameworks such as Vue
The goal of this website is to educate web developers about web speech recognition and encourage them to fill these gaps. The easier it is to build voice-driven web apps, the more that will get built and be of higher quality. This will improve the technology and best practices in this area, ultimately leading to better user experiences!