Redphone + Speex = ?

18 Apr 2014

Redphone + Speex = ?


I posted on Twitter a picture with a slightly too cryptic message. That post got far more views than I expected, which leads me to believe that people think I’ve found a real vulnerability.

This is not the case - I posted it because I think there’s something there, but since I’m not a full-time security researcher or cryptographer I don’t have the time or budget to take this further - especially if I want to do anything else in my spare time.

My goal was to see if one of the security researchers who follow me wanted to check it out.

Unfortunately in trying to get some of the Matasano guys to have a play with it Matthew Green noticed it and retweeted - causing it to have a far larger circulation than it deserved. (Doesn’t he normally just favourite stuff?)

What does the picture mean?

The picture shows the raw timing of the Speex codec as configured in Redphone of two sets of samples - one is all-zeros (blue), one is some data I gathered from /dev/urandom (red). (Both were pre-sourced from files in my test program)

The reason I performed this test is that Speex (sort-of?) uses CELP, which basically amounts to searching through previous audio data to find a match, then offering a back-reference rather than raw audio data.

This signal is possibly equivalent to looking at the size of packets from a VBR codec - which since Redphone had set the codec settings to CBR - seems to be something that was trying to be avoided

Can this be exploited?

I’d love to know the answer to that question - that’s why I didn’t just delete the work I’ve done and pretend it never happenned. This has implications for other VoIP systems too (such as WebRTC).

The answer to that is almost certainly hardware-specific. The CPUs on Android phones change speed according to load, the kernel could preempt the task, or any other number of things could be happenning.

The CPU and bus speed of the phone would play a major part in this - definitely worth trying on an older phone (or setting the CPU scheduler to powersave) when first testing.

A more likely problem would be how the network hardware schedules transmits - is it sending at fixed intervals, or does sending a packet wake up the chip? Some wireless protocols send RTS packets - so even though the network packet is delayed somebody in radio range could find out when they were ready to send and simply count that as the right time.

In any case - I don’t know. The answer might be “no” for all current technology but change in some future technologies.

What can be done?

I suggested to Moxie that double-buffering could be used, which he balked at since he’d just finished a ton of work reducing voice latency. This just means sending the previous packet immediately after reading from the sound chip.

My professional opinion (since my real job is working full-time on VoIP systems) is that a G.711 codec should be used.

The bandwidth requirement is under 100Kb/s for this codec (different network technologies have different overheads and your connection might actually be a stack of Layer2 protocols).

Since each sample is independent there’s no problem implementing it in constant time (the sound chip probably does it natively anyway).

The coder latency is extremely low (basically just the ptime).

Recovering from packet loss in G.711 can be done a number of ways on the client side, and it generally sounds fine in relatively low packet loss environments - certainly better than the current Speex codec does.


Sorry to the Open Whisper Systems guys - I didn’t intend to spread FUD about your product (although I feel like that was the outcome).

Also sorry to anyone who mistook my Tweet as having more meaning than it really does.