[Note: as of now, I have not yet fully finished rewriting the firmware. However, during the process, I have run into a vast amount of interesting information, which I am sure I will forget in the end. So, I decided documenting the rewriting process in parallel within this post. Probably, when I finish up, I ‘ll edit out this first paragraph.]
I decided to rewrite the firmware because of the following test: one second’s worth of 64kbps-encoded PCM data sampled at 8kHz using 8-byte values (be it linear, μ- or A-law) consists of 8000 bytes. Using 8-byte chunks, this corresponds to 1000 chunks of data being exchanged between a USB device and the PC within a single second. So, I decided to measure how long my current implementation takes to transfer this amount of data from the PC to the Open USB FXS board. And the result was: 32 whole seconds.
In other words, I would need to make my board perform 32 times faster. The current firmware used USB polling and request servicing in a serial manner, so I judged that this was never going to work. I had to make drastic changes.
One thing that I could not help noticing right away was that I had been using the old version of Microchip’s firmware (1.x); however, the company has moved to version 2.x, which is very different in its philosophy. Version 2.x is much cleaner and more abstract than 1.x; the framework files are used from the compiler’s tree and not copied over to the user code (actually, this is not a framework change, but rather an IDE enhancement); the cumbersome multi-level directory structure of 1.x is now gone. Porting my code to the new version proved easy enough, although it took me quite some time.
Porting to 2.x came with a pleasant surprise: the firmware was much, much faster now. My previous test (sending over 1 second’s worth of PCM data) now took only 2 seconds instead of ~32. One good explanation is that the new version uses USB-specific ram as buffer space, thus eliminating data copy operations between USB-specific and “user” RAM. Later, I tried to both send and receive the same amount of data; this amounted to slightly more than 4 seconds. Which, as I was soon to find out, was the best one could hope for.
Soon after, I started scratching my head about isochronous USB transfer. I read and re-read the standard and tried to find something in the firmware and/or the C++ examples from Microchip that could serve me as an example. In the course of studying isochronous transfer, I found out why my four seconds was the lower bound for bulk transfer: the USB Bulk transfer mode requires an ACK for each data packet, and a stream of packets is only initiated after a SOF (start-of-frame) microframe. The USB host sends one SOF per millisecond. So, sending over a data chunk of 8 bytes and receiving back the upstream equivalent takes four milliseconds (at best): one for the OUT data, one for the ACK back to the host, one for the IN data and one for the ACK back to the device. It was clear that I would never be able to perform faster than that, unless I increased my chunk size. A bare minimum of 32 bytes was needed. Not what I had in mind.
This made me search further for isochronous USB transfer. This mode does not require handshake (ACK packets). Again, the upper bound is one packet per SOF frame, but this was exactly how much I needed. Maybe I could provide something like a self-clocked synchronous transfer method as follows: the firmware provides a SOF callback function, which is invoked on every SOF microframe. Then, inside that function, I would send out a chunk using an isochronous IN packet (note that the “IN” and “OUT” is host-side terminology, so a USB device sends IN packets and receives OUT ones), and I would receive an isochronous OUT packet. Lost packets would never incur any timeouts etc. On the host side, I would wait for the IN packet (clocked at the rate of one packet per ms by the pace of the SOF) and reply with an OUT packet. Later on, when this would test OK, I would get rid of the ring data structures in my ISR and arrange the ISR code to mess directly with the USB packet buffers without copying (and also to synchronize with the SOF 1-kHz “heartbeat”).
My tests using the Microchip Generic USB host-side device driver all failed. As I found out soon thereafter, this device driver does not provide isochronous transfer primitives. Using bulk data primitives instead caused the primitives to halt waiting for an ACK which never came; then, of course, the primitives were timing out. No, definitely not good. I had to look further.
Googling the subject brought me some interesting discussions. Mostly this thread on the Microchip users forum site shows the efforts of some good men to make the PIC speak isochronous. In a nutshell, they all seem to agree that (1) both the PIC and the firmware can do fine with isochronous transfers, (2) Microchip’s Generic Host-side device driver is inadequate because it does not support isochronous transfer primitives. One example also mentioned libusb-win32 (available here) as an isoc-capable driver. So, I tried changing the device driver in my host-side controller.
[BTW: the Linux kernel supports isochronous transfers, so I am on the safe side with that – maybe it’s time to move to Linux? We ‘ll see…].
Apart from some minor issues (that I overlooked for the time), porting my host-side Windows source to libusb-win32 was not that hard. However, the tests I tried produced mixed results. Many times the IN isochronous primitives were failing, and there was no good explanation for that. If I only did IN without OUT transfers, then one every two primitives worked OK, while the other one failed. Intermixing OUT and IN transfers, the IN success rate was much, much lower, whereas the OUT success rate stays at 1:2. As of now, I have not yet resolved the issue.
In the meantime, I re-read the thread I mentioned above and noticed that some people report success using the Cypress CyUSB host-side device driver. This is another direction in testing: I needed to port once more my code to Cypress. Having done this once, I thougth I might be able to do it again.
Update, April 29: partial success using libusb-win32! One thing I had gotten wrong was that libusb expects isochronous transfer requests to be submitted using large buffers that the library fills in (or drains out, for the OUT direction) at its own pace [BTW, checking out CyUSB, it seems to work in a similar manner]. An interesting question then is, how can the user-level program synchronize with these asynchronous primitives? Obviously, one cannot wait until a large buffer is filled in and then proceed, since this would defeat the purpose of using small-size isochronous transfers. Libusb gives out a nice (though totally undocumented) way of doing it: one can ask the driver about the amount of data transferred so far (with the usb_async_reap_nocancel() primitive); if the number returned is larger than last time, the user program can safely assume that one more chunk has been transferred.
A delicate point with this method (and one of the library’s inadequacies) is that, if the IN-pipe misses a packet for some reason, then the returned value is not updated (although the pointer in the provided buffer is incremented, thus a missed packet creates an unidentifiable “hole” in the buffer). Even after receiving the whole buffer, other than using packet integrity checksums etc., one cannot really tell where the “hole” is, because there is really no signaling mechanism from the driver to the user space to inform on the event that a packet was missed.
So, I devised a method whereby I query the OUT-pipe for its progress, and assume that the IN-pipe will be ready at the same pace. This seems to work, in that the OUT-pipe is synchronized with the SOF frames and thus the transmitted byte count is incremented at a steady pace. This is nice, in that every millisecond I get the chance to run the relevant user-level code that checks for IN-packet data in a near-synchronous manner.
However, my success with this method was only partial: using a USB sniffer program I observed many missed IN-packets. It is really hard to tell what causes this. There are two suspects: the firmware and the driver. The firmware might come late in transmitting packets, or the driver may miss them whatsoever, although they are transmitted OK. A reason for the first scenario is the large portion of the time that the PIC spends into the TMR1 ISR, so it might respond late to the SOF. Clearly, I need more trial-and-error work here; I ‘ll report again later, by updating this post. [Nevertheless, I feel not so worried about all these little problems; a driver in kernel space should be able to handle isochronous transfer in much more efficient ways; for the time, I only need to make sure that the culprit is not my firmware].
Quick update, later on the same day: bypassing my ISR (by adding an immediate return instruction) makes the above test work fine with libusb! Now I can even rely on the reap_async_nocancel() from the IN pipe catching all packets, so the isochronous IN-pipe can be used reliably for timing purposes in the user code! So I guess the explanation is that, because the USB is serviced in poll mode and the ISR in its current form takes much time, the firmware gets to transmit the isochronous IN packets late some of the times. But, as I have already said somewhere earlier, I need anyway to rewrite the ISR and get rid of much of the code in there (ring buffer management and so on), so probably I ‘ll come up with an acceptable tradeoff between ISR time and USB polling. If not, I can still devote some ISR cycles to fire up the isochronous USB transfer from within the ISR in a synchronized manner (see next paragraph).
Moreover (and this something I have keeping in my mind, carefully sweeping it under the rug so far), I need anyway to synchronize the ISR to the SOF frames. This is because I will need to swap USB/PCM buffers around (preferrably using the PING-PONG buffering method of the PIC, which I still need to try out) after the first 8 PCLK-cycles of my 32-cycle ISR (remember that PCM audio I/O between the PIC and the 3210 occurs during these first 8 cycles, so buffers had better stay untouched at that stage). All this sounds like quite a lot of work — only this time it seems doable, whereas up to now, things felt more like in a survival-in-the-jungle bootcamp. The “Hello, World” milestone seems now closer than ever before!
Update, May 7: lots and lots of re-planning, strategy changes, rewrites from scratch and finally, isochronous INs (from the board to the PC) work fine! To cut a long story short, I decided to quit my early ideas of mix-n-match between interrupt and USB polling, and to rewrite my TMR1 ISR from scratch in order to make it fully capable of handling isochronous PCM I/O. Early trials were disappointing, in that I saw no packets at all coming from the board. However, when I addded code to synchronize once between the SOF frame and the IN isochronous packets, I started seeing some packets on the PC. With the aid of a USB sniffer, I noted that, sooner or later, my ISR was missing the right time frame to send a packet. This led me to a very tiresome and difficult debugging of my ISR, until finally I trimmed it not to miss a single clock cycle. It now works fine! So I plan to write a post dedicated to the ISR code, while in the meantime I will be progressing my PCM audio trials.