How I found a bug in a USB Ethernet driver
I have an old USB ethernet device. It’s an ADMtek ADM8511 “Pegasus II”. Apparently I am the only person left on the planet with this device because it has been causing kernel panics since Linux 3.10 (released around 3 years ago).
I had been using this device at work (to connect things to my computer without letting them access the office network) but when we finally upgraded from Ubuntu 10.04 (to Ubuntu 14.04), it started locking up the machine. Since it was work and time is money, I just bought a USB ethernet adapter that used a different chipset and put this one in a drawer.
At home, I have a machine that I’d like to dual home for… reasons. It’s also running Ubuntu 14.04. I got out the adapter and tried it out and sure enough, it caused lockups. But this time, I was willing to put a bit of time into investigating it since having it work was not critical.
I raised a bug and included as much information I could get from the system. One of the more interesting things I found out was netconsole, where a Linux system sends its logs out via UDP (and another machine captures this to a file). This is handy when the machine is kernel panicking because you don’t need to take a photo of the message and things that don’t fit on one screen can also be reported.
One annoying thing was that the bug triage person insisted I update my BIOS and re-test. Which meant I had to get Windows on the machine. I ended up making a “portable Windows” USB stick on an old (slow) SD card so that I could run the BIOS update program.
Next up was running a current kernel, v4.5 to verify that the issue had not been fixed yet. Then I had to establish roughly when the bug was introduced. I did a manual bisect of the releases (downloading 33 kernels in the process) to establish that it was a Saucy kernel that introduced the bug. Specifically, kernel 3.9.0-7.15 was good while the next kernel 3.10.0-0.6 was bad.
Then, I had to download a git repo and bisect the code. I was a bit shocked to find there was around 14,000 commits between those two tags! Thankfully, I only had to do 11 builds to identify the problem (though my poor machine could only manage one build every 2 hours and I had to be physically at the machine for the test so I could only get in 2 or 3 tests per day).
Interestingly, there were 3 commits in a row specifically for the pegasus driver. But it was the first one that caused the bug, or was it? Upon inspection, the commit looked ok (it was replacing an apparently unnecessary pool of buffers with demand-allocated buffers). It did look like the new code would consistently allocate 2 bytes less than the old code but I think that was more or less a padding/alignment thing that should have ended up equivalent. The real problem was elsewhere in the driver where the code was reading PEGASUS_MTU + 8 bytes into the buffer, which had been allocated to PEGASUS_MTU bytes. WTF?! I can only guess the pool allocation pattern meant that the overflow did not clobber important memory but once it switched to demand allocation, it started doing so. I did try git blame to see how this code got this way but apparently, it happened before Linux was in git.
I wonder if my proposed fix (changing the read to use PEGASUS_MTU bytes) can make it into the kernel with my name still on it? That would be cool. I have never pushed code to the upstream Linux kernel before.
It got in. And my name is there 🙂