Engineering a Secure Audio CAPTCHA Implementation

August 2013. by

Essential vs. automated spam submissions, image Captcha challenges confirm a human presence by testing for unique capabilities of human vision still unavailable to OCR software. Accessibility demands an audio Captcha alternative to help visually impaired visitors complete the form when Captcha images would present an insurmountable barrier. But Captcha sounds should provide such an alternative without also letting bots in – the code pronunciation should be understandable to human hearing, but remain beyond capabilities of simple voice recognition software...

Table of Contents

Achieving this required level of audio Captcha security involves a carefully designed mixture of multiple disciplines, such as:

  • modern techniques for sound waveform analysis and manipulation,
  • sound design adapted to unique qualities of the human sense of hearing, and of course
  • general principles of solid security engineering.

While a detailed examination of this topic could provide enough material for several PhD theses, we'll settle for a quick rundown of the major aspects and secure audio Captcha design decisions.

The Anatomy of an Individual CAPTCHA Sound

Each individual audio Captcha sound file includes code pronunciation, mixed with various types of audible noise, and processed by diverse audio effects.

Generating Clean CAPTCHA Code Pronunciation Audio

The starting point for each individual audio Captcha sound file is the Captcha code pronunciation in the the selected language. At this elementary stage, the main security concern of audio Captcha generation is to avoid making the pronunciation predictable:

  • The Captcha code should be completely random, and not a dictionary word. If we use dictionary words as Captcha codes, voice recognition that successfully matches all but one Captcha code character can be paired with a dictionary lookup to make recognizing that last character (and consequently, the whole Captcha code) much easier. Since using dictionary words would make the audio Captcha more predictable, it should be avoided.
  • The periods of silence before, after and in between pronunciations of individual characters can also be randomized. Using evenly spaced (hence, predictable) pronunciation would make segmentation of the sound file into parts with individual character pronunciations trivial. Ideally, we also want to make segmentation harder by sometimes using no silence between characters at all. Furthermore, it would be best if knowing the length of the sound file wouldn't automatically make the Captcha code length predictable as well.
  • Using multiple source pronunciations of each Captcha code character, with different speakers and voices (and pronunciation speeds and accents and...), and randomly switching between them can also reduce the overall predictability of Captcha audio.

Combining Clean CAPTCHA Pronunciation with Audio Noise

Making the clean audio Captcha pronunciation less predictable is a good first step – but as long as the resulting sound file contains only Captcha code pronunciation and silence, voice recognition software will still have a relatively easy time analyzing it.

To make recognizing the Captcha code harder, the clean pronunciation should be mixed with other sounds, i.e. different kinds of audio noise. Examples range from simple white noise, audio tones of varying length and frequency, various environmental noises such as recordings of footstep sounds and similar sounds, to quiet reversed pronunciations of unrelated strings.

While the possible types of noises suitable for audio Captcha generation are numerous, they should all be carefully chosen and engineered not to interfere with human comprehension – while staying bothersome enough to trip up automated voice recognition.

Of course, any predictable noise is easy to filter out, so noise duration, pitch, placement in the audio Captcha sound file and other parameters should be randomized to have any security impact at all. Audio noises such as these are a clear equivalent of drawing noises used by Captcha image generation, such as random lines, circles, dots, bezier curves, glyphs etc., and serve the same purpose.

Transforming Combined Audio CAPTCHA with Sound Effects

After mixing clean Captcha code pronunciation with random noise, we should make harder to automatically separate noise from data – where we consider the Captcha code pronunciation as "data" or a kind of "plaintext" we're attempting to obfuscate.

For this purpose, the mixed sound output can be transformed by different kinds of audio effects. For example, mixing the sound with delayed copies of itself can produce a whole range of effects, familiar to musicians and sound engineers everywhere: echo, reverb, chorus, flanger, phaser... Another example are sound effects that transform the waveform by manipulating sound amplitude, such as fade-ins and fade-outs, compression or expansion, tremolo, various kinds of distortion...

Random application of such sound effect combinations can be designed to introduce further roadblocks to automated sound analysis, while still keeping the overall sound comprehensible to human visitors. A Captcha image drawing analogue can be found in the various kinds of image processing effects, such as wave, skew or perspective transforms, median filtering, blur etc.

Designing Systematically Hard-to-analyze Audio CAPTCHA

When we understand what an individual Captcha sound file is composed of, we can move on to thinking about the broader picture of audio Captcha security vs. automated analysis.

When speaking of "security" and "engineering" in a Web context, your natural first association is likely to include encryption. However, encryption is based on mathematically-provable cryptographic principles and techniques, none of which really apply when talking about Captcha protection. As previously mentioned, we're primarily interested in obfuscating the Captcha code enough to make automated analysis hard – and the primary risk we have to handle is predictability.

The 1st Rule of Audio CAPTCHA Security: Randomize, Randomize, Randomize

Consequently, randomization is the first principle we'll adhere to when implementing audio Captcha generation. Silences between spoken characters, noise volume, placement, and duration, effect amplitude and duration – are all examples of values that benefit from randomization. Obfuscating "data" to make it hard to automatically tell apart from "noise" is much easier when both "data" and "noise" are random by nature, and their randomization characteristics overlap.

Thinking "what tone frequency should be used for this kind of noise" leads to noise that can easily be filtered out of sound files by recognizing that chosen frequency, and usually won't impact the pronunciation waveform much even if the pronunciation also contains that exact frequency. Thinking "what tone frequency range should be used for this kind of noise" leads to obfuscation that is much harder to bypass, since it makes both automated recognition of noise segments by frequency harder, and has a higher chance of impacting the pronunciation waveform when removed.

Taming Audio CAPTCHA Randomization with Sound Style Design

In theory, randomizing everything over a large enough range of values would produce audio Captchas most resistant to automated analysis. However, always randomizing every possible value would also increase the probability of generating a sound file that even humans have a hard time comprehending – certain combinations of noises and effects will work well together, while others will clash and produce incomprehensible results.

These two conflicting concerns naturally lead to an audio Captcha system design based on a number of "sound styles", i.e. combinations of noises and effects whose parameter ranges have been carefully chosen and tested to produce sound files that are easy enough for human hearing but challenge enough for automated analysis.

Sample BotDetect Audio CAPTCHA Sound Styles


The 2nd Rule of Audio CAPTCHA Security: Randomize Some More

Since the "non-analyzability" of audio Captcha is based on designed obfuscation instead of mathematically proven unbreakability, we have to think about making the overall security of the system reasonably resilient against the possibility of each individual measure being bypassed.

The fact that a particular audio Captcha sound style is currently good enough at resisting automated analysis doesn't guarantee it will remain good enough in the face of increasing software capabilities. And upping sound style "analysis difficulty" by simply increasing the amount of audio noise and effect obfuscation is not a reasonable option, because it will quickly lead to Captcha audio that machines have an easier time understanding than actual humans do – the very opposite of audio Captcha design goals.

Optimizing Audio CAPTCHA Security with Randomly Chosen Sound Styles

Instead, we'll focus on getting the most obfuscation payoff possible out of the least amount of hassle for human users. And we'll achieve this by having a large pool of sound styles available, and using a randomly chosen one for each requested Captcha sound. That way, not only will automated analysis attempts have to be able to subvert one or more of the sound styles, but they'll also have to be able to recognize which sound style they're dealing with in the first place.

This approach leads to overall audio Captcha security that is a product of sound style "identification difficulty" and average sound style "analysis difficulty", and easily increased by adding more sound styles to the pool of available ones.

The 3rd Rule of Audio CAPTCHA Security: Err on the Side of Usability

While focusing on a security mindset, it's easy to get carried away and create an audio Captcha implementation that is more obfuscated than an average spy thriller – and unfairly hard on human visitors. So it is important to always keep in mind that a blind person is trying to use the website, and you are in their way.

The necessity of Captcha protection is an unfortunate fact of our current Internet reality, brought into existence by spammer malice and thoughtless disregard. And if, from the perspective of a blind person, your solution is characterized by similar malice and thoughtless disregard – how is that solution any better than the problem it set out to solve?

Thoughtful Application of CAPTCHA Protection

Every Captcha challenge should only be placed where it's really needed, and only be as obtrusive as is absolutely required. Captcha implementation on a form should alway be evaluated both from a security perspective ("Does it effectively stop spam bots?") and a usability perspective ("Does it present a small enough barrier to all of our visitors?").

For example, only showing a Captcha challenge on your login form after 3 failed authentication attempts will be 99+% as effective at preventing brute force attacks, but will significantly reduce legitimate user annoyance.

Conclusions & Further Reading

Implementing audio Captcha protection that is both accessible to visually impaired visitors and effective against bots requires careful balancing of randomization-powered obfuscation (making it unpredictable against automated analysis), and user-focused usability testing.

If you want to check how does BotDetect live up to these high standards, test Captcha audio here.

Sound Advice: All About Audio CAPTCHA