Archive for July, 2009

The (un)holly explanation

July 25, 2009

As the title says, this post is all about the reason why my board sometimes used to produce garbled audio. As of this writing, I am in the midst of fixing the bug, and in some cases I am already able to listen to the complete 15-second audio file from the Asterisk IVR collection (“press one to record your unavailable message, press two to record your busy message” etc.) without a single audible hickup. This is “a small step for the project, but a giant leap for me and my own good mood”! I was nearly depressed with this bug resisting arrest all this long time, but the morale of the story is that in the end, no bug can last forever. An optimistic message, isn’t it?

To recapitulate: until my previous post, I was in doubt whether something was wrong with the board’s firmware or the issue of garbled audio was caused by libusb. Well, as one would expect, the answer is not so straightforward. The problem has a dual cause: it is was caused by a bug in the firmware which was triggered by libusb occasionally delaying a OUT packet.

So, I am going to explain the issue here. If all this is too technical for you, you may skip to the next post (whenever this comes out). There’s absolutely no need for you to go through my bug analysis session. But, if you find it instructive (or amusing, or both) to read stories on how other people blew it, you are on the right track here.

The libusb side of the problem is easy to understand. Ocasionally, libusb was delaying an OUT packet a bit. I am not sure yet why this is happening, but I am not complaining: in a real-life VoIP environment, a lost or delayed packet is something usual. Although this should not happen in a USB environment (with just a one-foot shielded cable intervening between the software and my board), the truth is that it is allowed to happen. Isochronous USB I/O clearly allows packets to be lost in the way. Thus, my firmware should clearly allow for a packet to be missed or delayed. Did it really?

Again, the answer is not so straightforward. My PIC USB configuration employs something called ping-pong buffering, which uses two sets of USB endpoint control descriptors, a set for even transmissions, and another set for odd ones. In my TMR1 ISR code, the thing was designed to work as follows: I had two sets of OUT/IN buffers, one OUT/IN pair for the even ping-pong phase, and one for the odd one. The reasoning was that I would “arm” two USB I/O operations (and OUT and an IN one) using, say, the “even” endpoint control descriptors and the even set of buffers; then, while waiting for these I/O operations to complete, I would do PCM I/O to and from the 3210 transferring data from and to the odd buffers. Then, at the next ping-pong phase, the “armed” even I/O operations for the even buffers would have completed, so I could “arm” new I/O operations using the odd endpoint control descriptors and the odd buffers, while doing PCM I/O to and from the 3210 using the even buffers. And so forth.

Of course, my code provided for packet loss in the OUT direction (that is, OUT from the PC to the board, which is the input direction for the board — confusing, I know, but one gets used to it after some time). If, at the time that the ISR was expecting an OUT packet, no such packet had shown up yet, the code would just ignore the packet loss and fill the respective (even or odd) buffer with PCM “silence” (a bunch of 0xFF’s).

When writing the code, this looked like the right thing to do; however, as usual, doing the right thing turns out to be more complicated than one thought. The first question to ask is, what will happen if libusb misses one packet? So, instead of, say, sending packets 1, 2, 3, 4, 5, 6, …, libusb sends packets 1, 2, 4, 5, 6, … — what will happen then? Supposing that packet 1 corresponds to an even buffer, this is the sequence of events:

1 --> E
2 --> O
 X --> E

, and then, guess what: packet number 4 will not be received, because it’s still the turn of an even OUT operation! The PIC queues up to four USB packets internally, so packet 4 is queued to be received by the next even OUT. However, the ISR tries an odd OUT and sees no packet there. Thus, the next events are

[4 -Q> Ε]
X --> O
4 --> E
 5 --> O

This doesn’t look so bad, does it? And, it should not sound too bad either: only two milliseconds of silence, and then everything is back in order, queued a bit behind.

While all this is fine so far, what will happen if libusb delays a packet but does not miss it? In other words, what if libusb transmits instead of packets 1, 2, 3, 4, 5, 6, …, packets 1, 2, X, 3&4, 5, 6, … ? This means that packet 3 is delayed a bit, so that when the ISR checks for the packet, it doesn’t see it, but the packet is sent shortly after, within the same SOF frame, and the PIC sees it and queues it. Packet 4 then comes on-time, so no loss occurs. Here is what happens in this case:

1 --> E
2 --> O
X --> E
[3 -Q> E]
4 --> O
[5 -Q> E]
3 --> E
 6 --> O

Wow! This looks quite different, doesn’t it? The effect is that all packets are reordered, and they are now being “played” in the opposite order than the intended one, like this: 1, 2, 4, 3, 6, 5, …, until eventually, the same thing happens again, and the order is fixed. This was the notorious bug that had been producing garbled audio (and causing me endless hours of headache and nights of fruitless debugging)!

To resolve this, I am working towards decoupling the odd-even endpoint descriptor handling from the odd-even buffers. So, if a (say) even phase OUT yields no packet, in the next round I re-try the even USB endpoint descriptor, but this time data will be pointed to the “odd” buffer (it does not really make sense to call the buffers odd and even anymore). This seems to work right most of the time, but the firmware still needs more fixes here and there. But things look (and sound!) definitely better now than before.

This was the (un)holy explanation behind a difficult and elusive bug. It certainly feels nice to leave such things behind and go on. So, what’s next? But of course it’s Linux, and Linux-wards my footsteps are driving me! But more on this in my next post(s).

Update July 26 (later on): here is an example of mis-sync between the PC and my board, as captured by the USB sniffer (you may need to click on the image to make it appear full-size):


In each row sniffed, you can see two IN packets, each starting with the “magic sequence” 0xBA, 0xBE (I call them “the babes” :-)). Next to that, you can see a 0xDD for “odd” packets or a 0xEE for “even” ones (the change I made does not affect IN packets, so this corresponds to the ping-pong phase of the PIC USB system). Then, byte #3 (starting from zero, the first boxed column of packets is #3) mirrors the sequence number of the respective OUT packet (so, under normal conditions,  it is incremented on every IN packet). Then, bytes #4 and #5 (on odd packets only) reflect the value of TMR3 at a specific point during the TMR1 ISR sequence which occurs once before an odd packet is transmitted (the value is in little endian byte order). Subtracting each number in this sequence from the previous one yields the constant number 0x5DC0, or 24000 decimal (which, multiplied by the 83,333+ ns that each PIC clock cycle takes, yields 2 milliseconds!). The next byte (not boxed, has the value 0x5) on odd packets reflects the number of OUT packet misses since last reset of the board. The last boxed column, byte #7 on even packets only is a IN sequence number, which is incremented independently on each even packet.

Now, observe that the buffer contains only 31 packets, whereas it should contain 32. The 19th packet, albeit correctly an odd one, contains a wrong sequence number (thick gray box, value 0x12 — the expected value is 0x52). The next packet is again an odd one! This is definitely a mis-sync or other similar issue on the side of libusb, because the firmware cannot repeat transmission of an odd packet. This second odd IN packet continues correctly the sequence of mirrored OUT packets (0x52)! In other words, no OUT packet was missed (and byte #6 of even packets is constantly 5, or else it should have been incremented). But now, subtracting 0x64F9 (TMR3 value on last correct odd packet appearing on the right-hand side) from 0x2079 (TMR3 value on first odd packet appearing on the left-hand side), one gets 0xBB80 (decimal 48000, or 4 milliseconds)! This means that an even IN packet is missing altogether, and this is also proved by the sequence numbers of even packets (the boxed sequence on the left-hand side ends at 0xAB, and the one on the right-hand side starts at 0xAD).

In my opinion, this shows quite clearly a mis-sync situation between the board and libusb. In this case, the board has recovered without any audible distortion. What is still puzzling is why no packet appears to have been lost in the OUT direction. If, by reading this, you can suggest an explanation, I ‘d be happy to hear about it.