While contemplating my next step (to finish off with a simple channel driver, or to try planting my device driver into the Zaptel/Dahdi family), I couldn’t help but feel a bit angry at my newly-born dongle board, because at first glance it seemed not as perfect as I had expected it to be. Besides the heat dissipation issue that I reported in my last post, I also noticed — and reported in my initial dongle-is-ready post — that the dongle was producing bad-quality sound in recording. This time, though, I was decided not to give up so easily. You see, with the dongle, I had accumulated just too many boards with sound problems, and attributing these problems to bad materials on all of them just did not make sense. So, I decided to subject the dongle to a series of torture tests.
Surprisingly, I was to find that occasionally the recording quality was perfect. This meant that the problem should not have to do with components, but rather with some condition in the digital part of the circuit, or, even more probably, some I/O glitch. But what could that be?
It was thus (and very reluctantly, believe me) that I decided to postpone the channel driver work and go back again into debugging bad audio quality problems. What new could I find by debugging again now? Well, in contrast with my early development days, when the only tool I had to debug packet loss, reordering, etc. was a USB sniffer, now I had a complete kernel driver to play with. So, I decided to go and plant some monitoring code into the driver. I ‘ll cheat a bit by telling you right away up front that the debugging findings proved to be surprisingly interesting.
To understand debugging, remember that audio data packets contain a header. The OUTgoing packets (PC to board) contain a 1-byte sequence number. The INcoming packets (board to PC) contain two of those: a sequence number of their own (incremented independently on odd and even packets) and a mirror of the incoming sequence number. If I made use of sequence numbers, debugging packet loss and reordering issues would be easy. Apart from packet loss and reordering, there were two other conditions that I needed to debug as well: input overrun (when nobody picks data delieverd from the board) and output underrun (when there is no data while it’s time to send some to the board). The driver knew all about these cases and could report them right away.
So, I implemented some statistics counters in the driver, adding of course the necessary sophistication salt by instructing open() to reset all the counters, so that I would get a fresh image each time a user program would open the device anew. I ran again my output-only and input-only test programs, only this time I had inserted a piece of code instructing the programs to report the collected statistics roughly every second (every 8,192 samples, to be exact).
An excerpt of the output messages from the output-only test program follows. I guess it would help you understand what’s happening if I decoded the runes for you, so IN_OVR stands for “input overrun” conditions (each overrun is repoted once, regardless of the number of packets lost) , “IN_MSS” is “missed IN packets sequence numbers” (deduced from gaps in incoming sequence numbers), “IN_BAD” represents “bad” packets (reported by the USB core to be received incorrectly), OUTUND stands for “output underrun” conditions (again, each underrun is reported once) and OUTMSS is for “missed OUT packet sequence numbers” (sequence numbers not mirrored as expected).
IN OVR: 494, IN_MSS: 0, IN_BAD: 0, OUTUND: 250, OUTMSS: 0 IN OVR: 750, IN_MSS: 2, IN_BAD: 0, OUTUND: 250, OUTMSS: 3 IN OVR: 1006, IN_MSS: 4, IN_BAD: 1, OUTUND: 250, OUTMSS: 89 IN OVR: 1262, IN_MSS: 4, IN_BAD: 1, OUTUND: 250, OUTMSS: 113 IN OVR: 1518, IN_MSS: 4, IN_BAD: 1, OUTUND: 250, OUTMSS: 113 IN OVR: 1774, IN_MSS: 4, IN_BAD: 1, OUTUND: 250, OUTMSS: 113 IN OVR: 2030, IN_MSS: 4, IN_BAD: 1, OUTUND: 250, OUTMSS: 113 IN OVR: 2286, IN_MSS: 4, IN_BAD: 1, OUTUND: 250, OUTMSS: 113 IN OVR: 2542, IN_MSS: 4, IN_BAD: 1, OUTUND: 250, OUTMSS: 113 IN OVR: 2798, IN_MSS: 4, IN_BAD: 1, OUTUND: 250, OUTMSS: 113 IN OVR: 3054, IN_MSS: 6, IN_BAD: 2, OUTUND: 250, OUTMSS: 189 IN OVR: 3310, IN_MSS: 6, IN_BAD: 2, OUTUND: 250, OUTMSS: 189 IN OVR: 3566, IN_MSS: 6, IN_BAD: 2, OUTUND: 250, OUTMSS: 189
The results were quite interesting (well, at least they did look interesting to me): since the program was not reading any samples that the board was sending, the IN_OVR statistics increased steadily. That was OK. Every now and then, a couple of IN packets were missed. This was definitely not OK, and I should investigate further why it happened. By checking the IN_BAD column, it looked like every now and then the USB core received a “bad” IN packet (which it ignored) and the IN packet next to that was also lost. Hmmmm… why would that happen in pairs? I needed to think it over… The OUTUND column reported a constant number of underruns, which presumably occured from the time the program open()s the device until it starts sending audio. That was perfectly OK. The rightmost column was more intriguing. It looked as if, every now and then, there was a burst of lost sequence numbers. A DEBUG statement in the driver helped me remember that I have been “flexible” with OUT packets in my ISR code, allowing odd and even packets to be reversed when there are delays. So, while the driver was expecting, say, mirrored OUT sequence #10, it got back sequence #11. It then set for expecting sequence #12 (11 + 1), but received instead sequence #10 that it had been missing. It set for sequence #11, but received sequence #13. And so forth, until one packet got late and sequence numbers got reversed again.
What was by far more interesting was that OUT delays resulting in sequence number reversals, which in turn manifest themselves as bursts of missing sequence numbers, seemed to occur at the same time as IN bad/lost packets. Hmmmm… I’d better make a note of this.
Here were the results I got by running my input-only test program.
IN OVR: 254, IN_MSS: 0, IN_BAD: 1, OUTUND: 498, OUTMSS: 75 IN OVR: 254, IN_MSS: 0, IN_BAD: 1, OUTUND: 754, OUTMSS: 75 IN OVR: 254, IN_MSS: 0, IN_BAD: 1, OUTUND: 1010, OUTMSS: 75 IN OVR: 254, IN_MSS: 0, IN_BAD: 1, OUTUND: 1266, OUTMSS: 75 IN OVR: 254, IN_MSS: 0, IN_BAD: 1, OUTUND: 1522, OUTMSS: 75 IN OVR: 254, IN_MSS: 2, IN_BAD: 1, OUTUND: 1778, OUTMSS: 76 IN OVR: 254, IN_MSS: 2, IN_BAD: 1, OUTUND: 2034, OUTMSS: 76
As it made good sense, the IN_OVR columns reported a steady number of (initial) IN overruns, while the OUTUND showed an increasing number of OUT underruns, since nobody was feeding the driver with data. The IN_MSS showed two missed IN sequence numbers, but only one OUT sequence was lost. All this meant that no substantial issues were observed, although one IN packet was lost somewhere in the way.
While it was too early to tell for sure, I was suspecting that all these — especially the “bad” IN packets, where each “bad” packet resulted in a pair of missed IN sequence numbers — were signs of bad synchronization between the ISR and the PC’s USB host controller. A possible explanation was that, while the ISR was clocking the I/O between the PIC and the 3210 diligently at (its own idea of) 256 kHz, the ISR synchronized (or so I thought, more on this later on…) with the host-originated USB SOF microframe only once in a lifetime. Thus, if a small drift existed between the USB clock and the board’s clock, eventually the IN/OUT isochronous traffic from the board drifted too, until the IN frame coincided with SOF (hence, it was lost), and the OUT was expected too early or too late (hence, an OUT frame was received “late” and the OUT odd/even ping-pong buffers were reversed).
If the above scenario was true, this meant that there were actually two caveats in my initial naive design. The first was the assumption that it suffices to synchronize the SOF time between the PC and the ISR code once and only once (or so I thought, more on this later on…), and let the two clocks run in parallel thereafter. The second caveat was that the device could work its internal audio I/O at a rate that is independent from the USB SOF rate (without some sort of a feedback-adaptation mechanism, like the feedback endpoint and the variable packet size in audio-class USB devices).
My first step in debugging was to confirm the hypothesis that these two caveats were indeed the cause behind missed packets. Thus, I implemented a new simple function in firmware to report in a 32-byte packet fifteen times the value of TMR3 sampled right after the SOF bit is asserted by the PIC’s USB engine [with TMR1 interrupts temporarily disabled and without actually modifying TMR3, only by sampling its value, so that I could deduce the elapsed time exactly by subtracting two successive values from one another]. Then, I added a few lines of code to my Windows test driver program in order to print out the difference between successive TMR3 values (remember that TMR3 is running at one tick per instruction). Here is the result:
01 02 03 04 05 06 07 08 09 10 11 12 13 14 2EE3 2EE0 2EE0 2EE3 2EE0 2EE0 2EE3 2EE0 2EE3 2EE0 2EE0 2EE3 2EE0 2EE3 01 02 03 04 05 06 07 08 09 10 11 12 13 14 2EE0 2EE0 2EE3 2EE0 2EE3 2EE0 2EE0 2EE3 2EE0 2EE3 2EE0 2EE0 2EE3 2EE0 01 02 03 04 05 06 07 08 09 10 11 12 13 14 2EE0 2EE0 2EE3 2EE0 2EE0 2EE3 2EE0 2EE3 2EE0 2EE0 2EE3 2EE0 2EE3 2EE0 01 02 03 04 05 06 07 08 09 10 11 12 13 14 2EE0 2EE3 2EE0 2EE3 2EE0 2EE0 2EE3 2EE3 2EE0 2EE0 2EE0 2EE3 2EE0 2EE3
By multiplying these with the PIC’s nominal instruction cycle time Tcy, one gets most of the times 12,000 instructions, that is, exactly one millisecond, but sometimes one gets 12,003 instructions, which is slightly more than one millisecond (1.00025 ms, to be exact). This meant that the two clocks were indeed drifting w.r.t. to one another, with the PC’s USB clock being just a liiiiitle bit slower than the board’s clock, QED! Since I run this test against my large-form-factor prototype #2, I thought about running it against my dongle board as well. So, here are the results for the dongle:
01 02 03 04 05 06 07 08 09 10 11 12 13 14 2EE0 2EDD 2EE0 2EDD 2EE0 2EE0 2EDD 2EE0 2EE0 2EDD 2EE0 2EE0 2EDD 2EE0 01 02 03 04 05 06 07 08 09 10 11 12 13 14 2EE0 2EE0 2EDD 2EE0 2EE0 2EE0 2EDD 2EDD 2EE0 2EE0 2EDD 2EE0 2EE0 2EDD 01 02 03 04 05 06 07 08 09 10 11 12 13 14 2EE0 2EDD 2EE0 2EE0 2EDD 2EE0 2EE0 2EDD 2EE0 2EDD 2EE0 2EE0 2EDD 2EE0 01 02 03 04 05 06 07 08 09 10 11 12 13 14 2EDD 2EE0 2EE0 2EDD 2EE0 2EDD 2EE0 2EE0 2EDD 2EE0 2EDD 2EE0 2EE0 2EDD
Wow! As it was turning out, this time the PC USB host controller’s clock was a bit too fast compared to that of the dongle board. No wonder the my two working boards (prototype #2 and dongle) showed slightly different sound problems, then… One less mystery to solve! So now that the culprit for missed packets was revealed, the question was how to fix the bug.
To return to an old and long-lasting theme in this blog, this is exactly the reason why people have come up with the audio class in USB: clocks between the host and the device are not the same and the obvious way to synchronize them is to adapt the sending rate. However, I felt — and I still feel — that modifying the firmware in order to support a composite device with two audio-class devices would be an unnecessary complication. Why? Because, unlike the case of an audio playback environment where the device instructs the host to adapt its sending rate to the device’s playback rate and there is an infinite buffer of source data, in a VoIP environment there is no globally accepted master clock (there are the remote-side endpoint, the USB host and the local-side USB device, and all these have different clocks), so data overruns and underruns will be inevitable in the end. However, why create one more such source of bad audio by not synchronizing the USB device to the host?
But wait a moment — why should we synchronize the sending rate to the USB clock by varying the amount of data per packet? If the skew between the USB host and the device clocks is not significant, it could be just as well that the USB device, instead of keeping a constant clock and adapting the amount of data per packet, adapts its audio recording and playout rate to the rate of the host! This would probably be impossible if the device had a pure hardware clock; however, with a firmware-generated clock, like the one in Open USB FXS, it did not sound impossible at all!
To take that road, I would have to resynchronize my ISR’s notion of time with the SOF signal not just once, but once every 256 ISR invocations (this is every 1 ms, at the rate of SOF). Unfortunately, adding a simple wait-for-SOF check in my ISR code would not suffice, because this would work only if the ISR was faster than the actual SOF rate; in the opposite case, SOF would arrive earlier than my test and the test would always succeed without synchronizing the two rates… After torturing my mind a bit, I came up with the following fix: what if I “stole” a few instructions out of the ISR’s schedule (say, once per 256 ISR invocations), so now the ISR would purposefully last shorter than the nominal 1ms? Say, 6 PIC instructions, which would amount to 0.5 microseconds? Then, I could create a “rendez-vous” point-in-time between the ISR and the SOF signal by means of a simple tight loop that would spin waiting for SOF to appear. What was exactly so devilish about this method was that it did not require any initial synchronization action: initially, SOF would by all chances appear somewhere else in the ISR, so my test would find the SOF bit already asserted. This would result in short-cycle execution, bringing the ISR’s overall execution time to 0.9995 ms. Thus, the relative position of SOF would gradually “advance” over time, occurring later and later in the ISR cycle, until a point in time would be reached where the test code in the ISR would find SOF not-yet-asserted. But in this case, the ISR would loop tightly waiting for the SOF to come. This would then happen over and over every 256 ISR cycles (once per SOF). Hence I would achieve synchronization between SOF and my board’s clock, which was what I was after!
All this was easier said than coded (in tight-time-profiled PIC assembly). First, I had to make room for my rendezvous point somewhere in the ISR code. I already had a special case for cycle #31 (counting starts from zero: cycle #31 is the 32nd cycle), so I moved FSYNC pulsing from that place to cycle #0 of the ISR. Setting TXS and RXS back to zero (instead of 1) was also required, but this worked fine (and passed the audio test without worsening the audio quality in recording or playback).
Then, it was the turn of the SOF test. I first removed a similar tight loop from the code, which was supposed to synchronize SOF with the ISR exactly once. I am saying “was supposed to”, because, as I found later on, that test was actually doing nothing, because of a stupid assembly bug: I had used a BTFSC instruction instead of BTFSS, thus testing for a clear bit condition instead of a set one, and that test came always true (it’s incredibly easy to make such stupid errors when coding in assembly, and it’s also incredible how easily these errors live for quite a long time until finally they get spotted). OK, that removed, I planted the test for one-every-256 invocations into cycle #31 and, after profiling carefully all the timing, I added the following tight loop for SOF synchronization:
SOFloop BTFSS UIR, SOFIF, ACCESS ; C:1/2 break loop if SOF is set BRA SOFloop ; C:2 otherwise loop waiting BCF UIR, SOFIF, ACCESS ; C:1 clear SOFIF for next time
These three lines of code cost me two afternoons of debugging in frustration and despair. Why? Because this code just froze the board: no signs of activity, no LED flashing, no USB device, no nothing. By #ifdef’ing out the three above instructions, everything came back to normal, so it was clear that the PIC was sitting in a tight loop waiting for a SOF that never appeared. Looking at the disassembly of user-level compiler-generated code showed me that the compiler-generated instructions for profiling the frequency of SOF (which I mentioned earlier in this post, and which worked fine), were the exact same ones I had used in my ISR — only in the ISR these very instructions were causing the board to stall. Stated differently, the exact same code was working in one place, but freezing the board in another. Shoot…
I almost went crazy for two days, trying to figure out why my code locked down the board. Finally, after lots of working hypotheses that proved wrong, enlightment came: the ISR was starting almost immediately after the board powered up, and before any USB negotiation. At that time, there was obviously no live USB connection, and therefore there was no SOF either. Thus, the processor just sat in a tight loop inside an ISR, doing nothing to proceed with the USB negotiation (which, in Microchip’s stack, is situated in userland). Of course, since the processor was doing nothing, no negotiation did ever take place, and no SOF ever appeared. Bingo! I quiclky added a test to activate the tight loop only after at least one SOF was seen (hence, negotiation was successful) and the board sprang back to life!
Now it was time for the infallible audio test. I run the test and — pfzzzt! — failure… I kept seeing missed packets here and there. This felt like madness, so I buried myself again in debugging. Added and removed stuff here and there, in the driver and in the ISR, but to no avail: both boards kept crackling from time to time, and the statistics reported missed packets in perfect accordance with the audible problems. However, I quickly noticed a difference, and it was a big one: “bad” IN packets had disappeared. This meant that my fix worked after all, and all I had to worry about were the missed sequence numbers.
It was then that my suspicions finally turned to the right direction. What if all this was VMware’s problem? A quick test confirmed the suspicions: increasing the priority of the VMWare process to the maximum almost eliminated packet losses and the like. Just to be totally sure, I installed quickly WUBI and tested my driver on a natively-running Ubuntu Linux system. And the truth shone: no packet losses at all, no crackles, no clicks, perfect sound for as long as I wished to listen!
A good question is, why hadn’t I seen these issues before? I seems that there are two good explanations: first, by adding more and more code (debugging, statistics, etc.) in the driver, I increased the load to be carried out, so occasionally VMware could not cope with that load (whereas a native Linux proved to have no problem). The other explanation is that I was reluctant to admit that there were still things to fix in my so-well-thought-of ISR code, so I tended to attribute occasional sound problems to bad hardware. I had to see similar problems appearing on more boards in order to admit the ugly truth and sit down to debug and fix the firmware.
But it was not over yet… The two test boards (the dongle more frequently than the large-form prototype) were now occasionally producing distorted audio on input. Yes, only in the input direction. With the priority increase in VMware, my statistics were showing that there weren’t any packet losses at all. However, sound was from time to time recorded with distortion, somewhat like when listening to audio equipment that clips an audio signal. It took me still another day to think about associating this issue with the FSYNC pulse.
Being mislead by the PCM timing diagram on p. 16 of the Si3210 datasheet, I had coded FSYNC to be asserted shortly before the falling edge of PCLK on the first ISR cycle. However, the timing diagrams on p.55 of the datasheet clearly show FSYNC rising right before the previous rising PCLK edge. But then again, why didn’t that problem too show up earlier? Well, the first thing is that it did appear occasionally on some boards. I guess that my recent SOF synchronization changes introduced some jitter that the 3210 apparently did not like very much. My assumption is that the 3210 was thus occasionally hastening or delaying the high-order bit of DTX. This bit being the sign bit of an 8-bit PCM sample byte, and depending on when/how often the bit was lost, the loss resulted either in a clip-like audio effect or into something like a vocoder-modulated signal, much like “Darth Vader’s voice”, as I have reported in some of my earlier posts. [Note: I am not sure of this explanation; it’s just the best explanation I could come up with. Maybe some experienced 3210 user or some other audio/DSP specialist will offer a better explanation than mine. I am all ears.]
So I attempted several fixes, mostly trying to relocate the rising edge of FSYNC somewhere else in the code, at a point where it would be better understood by the 3210. The fix that seemed to work best was to move FSYNC back to the 31st cycle, as early as possible before the falling edge of PCLK. This produced tolerable (though not 100% crystal-perfect) audio quality in recording on both boards (prototype #2 and dongle). Since the two boards behave quite the opposite in terms of SOF synchronization (the former is faster and the latter is slower than the USB host controller’s clock), it seems that I may have covered all major cases. Yet another mystery solved!
It goes without saying that, while attempting to relocate the FSYNC pulse, I forgot several times to raise or to lower FSYNC here and there in the code, and this resulted in some very unpleasant surprises. One such surprise was that on occasions, 3210’s DR11 which is expected to always contain 51 decimal, was found to contain nonsense. The worst by far such surprise was however when the board suddenly refused to power-up the DC-DC converter. Finally, I tracked this down to be a forgotten FSYNC pulse in some code branch: the 3210 seems to update its register values that contain sampled values (but also its output PWM signal that drives the converter) at a rate of 1 32kHz, as clocked by FSYNC! So, because FSYNC was erratic, convergence of the DC-DC converter to the desired 65V was very slow. Incredible!
Concluding, what is it that I have accomplished (well, hopefully…) after all this debugging? Quite a few things: one, I have added the SOF profiling code, which, as I am discussing right below, is very useful; two, I have managed to synchronize the board’s clock to the USB clock without resorting to the complication of an audio-class USB stack (there is still one caveat: the skew between the two clocks must be within 0.05% , or else the board’s crystal must be replaced, especially if the board’s clock is slower, but the skew can be now measured, thanks to the SOF profiling code); three, I have attributed missed packets to VMware and tested my driver in a native Linux environment, where it works perfectly; finally, I have traced the source of distortion (audio clipping) in the IN direction, and hopefully have fixed that, both for slower- and for faster-than-USB board clocks.
Unless some other annoying bug shows up, this time I have to promise to myself that the next thing I ‘ll do is the Asterisk code. Let’s see whether I can keep my promise (OK, I may deviate a bit from that schedule in order to upload all recent changes in the driver and the firmware to the project’s Google code page). Meanwhile, if you think that you can help with development, please do reply to the poll below what is your preferred flavor of a board. I am repeating the poll right here, for those who have missed it in my previous post (I am also repeating here that the purpose behind the poll is to help in the formation of a small developer community — not to make profit by selling boards and kits):