Two good old friends

Sit back, relax… 

While visitors from Elektor were starting to visit my blog in waves, I was supposed to sit back, relax and enjoy the popularity of the site. Halas, seldom do things go exactly as they are supposed to, do they? By the time I was to relax, two user reports came that got me into a mad race to fix issues before they affected any other users too. But after all, that was the deal, right? Bugs were discovered. I had to fix them. A user community working in practice — exactly what I was hoping that my project would enjoy. Now that the race has ended and both reported issues are fixed, it’s time that readers heard the story.

DTMF intermittency

Back from the early days of the first Open USB FXS dongle boards, some of them were exhibiting an annoying bug. On occasion, and without any sign as to what exactly this “on occasion” coincided with, Asterisk refused to recognize DTMF tones from keys pressed on the phone. This was a very, very annoying bug, in that it made it impossible to dial out. Unplugging and replugging the board fixed the issue sometimes. Once the issue was fixed, it kept working OK for a lifetime — meaning, until the board was next unplugged or the computer was next rebooted (a good proof that a lifetime is a very short time). While I had not worried much about this issue, a user report made me think otherwise. Yes, to keep users happy, I had to fix the damn thing.

Sometimes I was able to cure the problem by adding or removing an echo canceller and/or tweaking the gain on the board and/or changing the “relaxdtmf” setting in /etc/asterisk/chan_dahdi.conf and/or changing the dahdi tonezone. But I could not get a consistent result. Googling about the issue did not help either. I came upon the same and the same advices: tweak relaxdtmf, check the gain, etc. There seemed to be people around hitting the same problem with different hardware, but there seemed to be no definite answers as to what caused the problem.

Even now that I have fixed this issue, I do not know for sure why it occurs. My understanding is that this issue is caused by some combination of echo, gain and timing settings. A possible explanation is that the output audio signal (dial tone) gets somehow echoed back together with the input signal (DTMF tone) and garbles it. This assumption looks more plausible than others, because the issue occurs less often with some single-frequency tonezones (gr, de, and others) than with dual-frequency ones (us). But I am sure that I am still missing something here, because this assumption alone cannot explain why, once DTMF works right, it works for a “lifetime”. I decided anyway to postpone this discussion for some other day, whenever I find time to really look into the root cause of the issue and fix that. For now, a reliable workaround would suffice.

Eventually, what helped me resolve the issue was the code for some other FXS board drivers in dahdi. There, I found a very interesting call to a function dahdi_qevent_lock(), which I ‘ll explain in a second. The board already implemented DTMF recognition in hardware; the recognized DTMF status had always been reported to the host piggybacked into the headers of isochronous USB packets, along with audio samples. BTW, on-board DTMF recognition has always worked reliably, but somehow I did not know what to do with it so far.

After discovering about dahdi_qevent_lock(), I found out that one can send DTMF events to the “dahdi layer”. These events are either interpreted locally by Asterisk, or they are translated into control events (in SIP and other VoIP protocols) to be sent to the remote end. One must also provide for an ioctl to “mute” the qevent functionality when “bridging” locally in dahdi (e.g., when one connects from a local FXS port to another local FXO port — the latter could be the PTT line, and muting the qevent functionality serves to pass the DTMF codes as plain audio to e.g. this PTT line once a call gets connected through it).

To cut a long story short, I added calls to dahdi_qevent_lock() and – wonder! The DTMF issue vanished. Now my board dials reliably each and every time. The downside of this is that the 3210 reportedly does not understand the DTMF systems of some Asian countries. Well, no problem, the on-board DTMF recognition functionality is optional: I have added a new module parameter “hwdtmf” (default value is 1), which can be turned off if you happen to live in China and the 3210 does not recognize correctly your DTMF tones.

The sound of silence

The second very, very annoying issue that I became aware of was this: when the board was left plugged into some systems and the system rebooted, one out of two reboots or so a terrible noise was coming out of the phone instead of a dial tone (hmm… why did this feel like a déjà vu?).

I first tried to reproduce the issue on my laptop. Well, I could not. No matter how many reboots I tried, there was no noise. But then, I tried on different hardware and, surprise: yes, I could hear the noise, too. It sounded like a jigsaw working in the next room — a jigsaw cutting into small pieces all my so far boasting about crystal-clear audio and the like.

An interesting detail was that, when Asterisk was stopped so that there was no dial tone, there was no noise. Silence transmitted fine. But what use could my board be if it could only transmit silence? I agree, it could make it fine into one of these April’s fool day circuits that Elektor magazine used to publish once, like a “fuse blower” (blew reliably every fuse you put into it) or a “solar torch” (did not have any batteries, and thus worked only under direct sunlight). Yeah, the “FXS circuit that transmits silence correctly” made a very good candidate new member of that family.

Coming back to debugging, what was it that caused the noise? It could be that the Linux driver was misbehaving, that dahdi was not working, or what? I quickly crossed out all these assumptions. A good deal of combinations of different kernel and dahdi versions all exhibited the issue. Moreover, a plain copy of an asterisk u-law-encoded audio file to the device also came out with noise. In that path, there was no Asterisk, no dahdi, nothing else than my driver and my board. And a good deal of despair, too — hadn’t I crossed out all these annoying bugs? When would they finally go away for good?

I first tried in vain to capture differences in the device status between the two states, when noise was present and absent. Nope. Si3210 register dumps seemed identical, other than a few insignificant differences in some analog values and calibration results; these differences could not be the root cause of the problem. The dahdi_diag utility also proved to output identical output, modulo the gain (I think that this is just the address of a gain buffer, so it is normal to see differences there, but I do not know for sure).

The next thing I tried was usbmon. This is a nice kernel debug uitility that sniffs USB packets as they go in and out the USB layer. I stopped Asterisk in order to let the driver send “silence” (0x7f’s) to the board, and then I cat’ed /sys/kernel/debug/usb/usbmon/2t (2 is a number that depends on the specific USB port in use), and I got this:

f6dad180 1406459225 S Zo:003:02 -115 16 = 00000001 1a400100 7f7f7f7f 7f7f7f7f
f6d183c0 1406459230 C Zi:003:02 0 16 = badd80ef 00001583 7e7efffd ff7d7efe
f6d183c0 1406459233 S Zi:003:02 -115 16 <
f6dad240 1406460220 C Zo:003:02 0 16 >
f6dad240 1406460226 S Zo:003:02 -115 16 = 00000002 1a400100 7f7f7f7f 7f7f7f7f
f6f8d000 1406460231 C Zi:003:02 0 16 = baee00f0 00000084 7d7c7eff 7e7e7eff
f6f8d000 1406460234 S Zi:003:02 -115 16 <
f6dadf00 1406461218 C Zo:003:02 0 16 >
f6dadf00 1406461224 S Zo:003:02 -115 16 = 00000003 1a400100 7f7f7f7f 7f7f7f7f
f6f8d3c0 1406461229 C Zi:003:02 0 16 = badd80f1 00001584 fe7e7e7d 7d7d7e7d
f6f8d3c0 1406461232 S Zi:003:02 -115 16 <
f6860180 1406462219 C Zo:003:02 0 16 >
f6860180 1406462225 S Zo:003:02 -115 16 = 00000004 1a400100 7f7f7f7f 7f7f7f7f
f6f8d480 1406462230 C Zi:003:02 0 16 = baee00f2 00000085 fffcfcff fefe7d7e
f6f8d480 1406462234 S Zi:003:02 -115 16 <
f6860cc0 1406463221 C Zo:003:02 0 16 >
f6860cc0 1406463230 S Zo:003:02 -115 16 = 00000005 1a400100 7f7f7f7f 7f7f7f7f
f6f8da80 1406463235 C Zi:003:02 0 16 = badd80f3 00001585 fefefffe ff7e7e7d
f6f8da80 1406463241 S Zi:003:02 -115 16 <
f6860000 1406464222 C Zo:003:02 0 16 >
f6860000 1406464229 S Zo:003:02 -115 16 = 00000006 1a400100 7f7f7f7f 7f7f7f7f
f6f8d900 1406464234 C Zi:003:02 0 16 = baee00f4 00000086 7d7dfffd fdfdff7d
f6f8d900 1406464238 S Zi:003:02 -115 16 <
f6860900 1406465219 C Zo:003:02 0 16 >
f6860900 1406465225 S Zo:003:02 -115 16 = 00000007 1a400100 7f7f7f7f 7f7f7f7f
f6f8d240 1406465230 C Zi:003:02 0 16 = badd80f5 00001586 fe7e7e7d 7c7e7d7e
f6f8d240 1406465233 S Zi:003:02 -115 16 <
f68603c0 1406466218 C Zo:003:02 0 16 >
f68603c0 1406466224 S Zo:003:02 -115 16 = 00000008 1a400100 7f7f7f7f 7f7f7f7f
f6f8d9c0 1406466229 C Zi:003:02 0 16 = baee00f6 00000087 fdfffffe 7dff7e7e

You can see output (Zo) and input (Zi) USB “packets”, consisting of eight bytes of header data and another eight bytes of audio data. This dump, produced with the noise issue absent, looked fine. Then I tried to capture a dump with real audio (some Asterisk tone output), when noise was heard. I got this:

f6900e40 3916897398 S Zo:003:02 -115 16 = 00000001 e3400100 15131720 3eaf9d96
f6d89000 3916897407 C Zi:003:02 0 16 = badd80ee 00003314 c6dc7e7e 72696b68
f6d89000 3916897416 S Zi:003:02 -115 16 <
f6900cc0 3916898377 C Zo:003:02 0 16 >
f6900cc0 3916898390 S Zo:003:02 -115 16 = 00000002 e3400100 94979fb5 3e241c1a
f6d899c0 3916898397 C Zi:003:02 0 16 = baee00ef 00000015 fb664c5e e9faddee
f6d899c0 3916898404 S Zi:003:02 -115 16 <
f69003c0 3916899369 C Zo:003:02 0 16 >
f69003c0 3916899376 S Zo:003:02 -115 16 = 00000003 e3400100 1c2333dc b0a8a5a8
f6d89540 3916899381 C Zi:003:02 0 16 = badd80ee 00003415 fb736362 5a63677d
f6d89540 3916899385 S Zi:003:02 -115 16 <
f69000c0 3916900370 C Zo:003:02 0 16 >
f69000c0 3916900376 S Zo:003:02 -115 16 = 00000004 e3400100 aebde94b 424653fe
f6d89780 3916900381 C Zi:003:02 0 16 = baee00f1 00000016 e1e35a64 ed68eaf8
f6d89780 3916900384 S Zi:003:02 -115 16 <
f6900540 3916901369 C Zo:003:02 0 16 >
f6900540 3916901376 S Zo:003:02 -115 16 = 00000005 e3400100 dde4593f 332d2b2f
f6d89c00 3916901381 C Zi:003:02 0 16 = badd80f2 00003516 7461edcf d3ccd5dd
f6d89c00 3916901384 S Zi:003:02 -115 16 <
f6900c00 3916902369 C Zo:003:02 0 16 >
f6900c00 3916902375 S Zo:003:02 -115 16 = 00000006 e3400100 3eceaea2 9d9ca0ad
f68906c0 3916902380 C Zi:003:02 0 16 = baee00f5 00000017 74584e49 4b4f5eee
f68906c0 3916902383 S Zi:003:02 -115 16 <
f6900300 3916903370 C Zo:003:02 0 16 >
f6900300 3916903376 S Zo:003:02 -115 16 = 00000007 e3400100 6d2b1c17 161a2447
f6890780 3916903380 C Zi:003:02 0 16 = badd80f4 00003517 cbc0d0de dc79f36e
f6890780 3916903384 S Zi:003:02 -115 16 <
f6900480 3916904369 C Zo:003:02 0 16 >
f6900480 3916904376 S Zo:003:02 -115 16 = 00000008 e3400100 ad9c9593 969eb637
f6890240 3916904380 C Zi:003:02 0 16 = baee00f5 00000018 655e5757 5873efd7

Just a pile of junk data? No, I do not agree. These two dumps contain a cornucopia of debug information and show clearly the cause of the problem (though not the root cause). At first, I was lost too in that sea of numbers. But then I looked back into my PCM packet headers, and enlightment came at once. First, let us isolate some Zo (OUT, from host to board) lines from the first dump and look at them (I removed the usbmon information and kept the data only for clarity; the excerpt is taken from the first dump, but the discussion below applies likewise to the second one as well):

00000001 1a400100 7f7f7f7f 7f7f7f7f
00000002 1a400100 7f7f7f7f 7f7f7f7f
00000003 1a400100 7f7f7f7f 7f7f7f7f
00000004 1a400100 7f7f7f7f 7f7f7f7f
00000005 1a400100 7f7f7f7f 7f7f7f7f
00000006 1a400100 7f7f7f7f 7f7f7f7f

One can’t help but noticing the increasing numbers 1, 2, 3, 4 in the fourth byte of the first word; yes, these are sequence numbers, which I built in on purpose, exactly for debugging (the other header information is irrelevant, so I won’t deal with that). Now, these get mirrored back into input packets (with an offset depending on {r,w}packsperurb and {r,w}urbsinflight. Here is a sequence of IN (board to host) data from the fist dump:

badd80ef 00001583 7e7efffd ff7d7efe
baee00f0 00000084 7d7c7eff 7e7e7eff
badd80f1 00001584 fe7e7e7d 7d7d7e7d
baee00f2 00000085 fffcfcff fefe7d7e
badd80f3 00001585 fefefffe ff7e7e7d
baee00f4 00000086 7d7dfffd fdfdff7d
badd80f5 00001586 fe7e7e7d 7c7e7d7e
baee00f6 00000087 fdfffffe 7dff7e7e

At the same position, fourth byte, you will see again an increasing sequence number. This is the mirrored OUT sequence. These headers contain other information too, some of which is relevant. If the second byte is 0xee, this packet was sent over the even USB endpoint, if it is 0xdd, over the odd one (odd and even endpoints are a PIC idiosynchracy that helps avoiding double buffering). Moreover, at the end of the second word  (8th byte), one can see some other sequence numbers increasing every two packets (0x83, 0x84, etc.) — this is normal. The first trace shows a perfect USB communication. No lost sequence numbers, no nothing. Let us now look at the IN (Zi) lines from the second trace:

badd80ee 00003314 c6dc7e7e 72696b68
baee00ef 00000015 fb664c5e e9faddee
badd80ee 00003415 fb736362 5a63677d
baee00f1 00000016 e1e35a64 ed68eaf8
badd80f2 00003516 7461edcf d3ccd5dd
baee00f5 00000017 74584e49 4b4f5eee
badd80f4 00003517 cbc0d0de dc79f36e
baee00f5 00000018 655e5757 5873efd7

Uh-oh… What do we have here instead of an increasing sequence? The weird sequence 0xee, 0xef (correct), 0xee (repeated, instead of the expected 0xf0, which is missing altogether), then 0xf1, 0xf2 (fine), 0xf5 (where did 0xf3 and 0xf4 go?), 0xf4 (so 0xf3 was lost altogether), and 0xf5 (again, so another duplicate instead of the lost 0xf3).

So this dump was showing fine what was happening down on the board: the PIC had gotten the sequences of IN packets all mixed up, and was sending audio with duplicate, lost, and re-ordered samples to the telephony chip. No wonder this was coming out as a jigsaw-like noise, then…

OK, this was a good explanation, but why was this happening in the first place? Why did this occur only on reboot? There, I made an assumption and tried to prove it correct. According to the assumption, the reason was that the PIC did not reset when the computer rebooted; it kept its previous state, and re-initialization caused this havoc. This was (relatively) easy to prove: I planted some special values to be echoed back with IN USB packet headers into several locations in the firmware. Among others, I added a counter that was incremented at each initialization step that I was testing. It took me 15 re-flashes of the firmware, but in the end I found it: the PIC was executing the USB SET_CONFIGURATION procedure twice! And this coincided exactly with noise.

OK, now I had a reasonable explanation as to what was triggering this havoc in data sequencing. But this did not really explain why the havoc was created, neither how to fix it. At that point, I made several assumptions. The assumption that made most sense was that, during SET_CONFIGURATION, the firmware “arms” an OUT (this means “receive” for the board, I know it’s confusing) operation to get the isochronous engine started. If the engine has already been started, this would confuse things. I thus tried to bypass this “arming” operation if SET_CONFIGURATION had already been executed once before. That did not work.

At that point, I decided to give a radical solution. Too many things in my firmware’s initialization sequence depend on the assumption that the PIC has just been powered on. So, what if I told it to reset if I found something I did not like? I thus changed the little test in the SET_CONFIGURATION initialization function to read: “if you have been here before, reset”. And this made the magic work. The noise that pested me for weeks went away for good. Just like that!

A question that I was left with after all this was, why did the issue only occur with some hardware (and not with my laptop). The answer I found had to do with power. My laptop seems to remove USB power when rebooting, and to take quite a few seconds to restore it. Other PCs seemingly either do not cut power completely (I think that the USB standard specifies a low-but-non-zero-voltage condition to signal a device powerdown?) or do cut power, but not for long. The latter had puzzled me because, on some rare occasions, when noise occurred, I had managed to unplug the board, re-plug it, and still listen to noise. But how could that be? The answer lies in the DC decoupling electrolytics. When the board is unplugged, the voltage drops, but not to zero, at least, not immediately. The DC-DC converter stops working, and maybe the 3210 powers off, but the PIC does not! PICs are designed to go into a power-saving mode, in which they can survive with peanut power, so a few hundreds of millivolts and a few microamps suffice for the PIC to keep its state for a few seconds, until the power stored in the electrolytics drains out and the voltage drops to a near-zero value.

So there was it, the answer to the puzzle. Some hardware does not cut off power completely, or reboots quite fast so that the PIC is not reset during reboot. Thus, before my latest fix, when the host rebooted, it started normally the USB initialization sequence on my board which however had kept its prior state. The PIC then got the OUT (receive) sequence all mixed up, and this was creating packet losses, duplication and reordering. And that was of course causing noise. Elementary, my dear Watson!

Although this fix of mine corrects the issue, I am not very happy with it, in that the firmware should somehow recover better from such a situation. I still do not know what exact part of the re-initialization causes the problem, and perhaps the problem could re-appear under different conditions. But, until then, I think I can live with this quick (sort of…) fix that I came up with.

The morale of the story

Is there a morale in this story? I don’t know. If yes, it must be that bugs are just like good old friends. Just when you thought you had forgotten all about them, and they had forgotten about you too, here they are — knocking on your door late in the night, right when you were contemplating a good sleep. And though you think you ‘d be far better off on your own, company — debugging — proves to be fun in the end. Until you finally get rid of them, just like when you ‘d show your good friends the door by 2:00 am. Only the bugs are not really your friends: two simple bugs like these — the board being unable to dial and outputing a terrible noise — could ruin this project altogether. Anyway, this project has so far survived from quite a few such visits from these “good old friends”. Let’s hope it is going to always make it!


4 Responses to “Two good old friends”

  1. Mihail Says:

    I patch the problem with terrible noise (packets reordering) and ‘metallic sound’ on some hosts and problem with set_configuration. I use microchip usb_device.c version 2.9i

  2. Angelos Varvitsiotis Says:

    Hi Mihail,

    Thank you very, very much for your contribution to the project! It fills
    me with delight to see people not only getting involved with the project,
    but also spotting bugs and providing fixes and patches!

    I would also be delighted to have your fix included (and/or merged, if I
    can do that) with the project’s source on googlecode. I would only like
    to kindly ask you to elaborate a bit on why the problem was occurring —
    in other words, what was the actual bug? what was triggering it? why did
    it occur only on some hosts? — and on how your patch fixes it. You may
    do that by replying to this comment, or via a private email to me (at
    avarvit -at- gmail /dot/ com). In the latter case, if you wish me to, I
    will write a short post introducing your fix.

    Thanks again very much, on behalf of myself and all the others who have
    gotten involved with Open USB FXS and whom your contribution I am sure
    will help very much!



  3. Mihail Says:

    The problem is related to the violation of the synchronization between the use of even and odd buffers in hardware controllers and software buffers. That is, for example buffer even poisoned paired with the previous odd. The problem arises when initialization due to the fact that different hosts at the start of transmission of isochronous data is lost or even or odd number of packets. I solved this problem by synchronizing ppindex and LastPingPong, forcibly copying LastPingPong in ppindex

  4. Angelos Varvitsiotis Says:

    Hi Mihail,

    Thanks again for your explanation. As I understand it, you are picking the value of the PPBI bit (bit 1 of the USTAT register, to which you are referring as ‘LastPingPong’) and checking this instead of (my) ppindex for OUT transactions. I think I get the idea, though I still need to study your code to make sure I understand every detail (there seem to be more fixes than just this one).

    If you do not object, I can upload your patches to the project’s google code repository for others to try.

    Thanks again for contributing this important fix!



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: