In which I fix a really odd problem
29 04 2008There’s this old story about a guy who comes to fix a problem. Spends a day looking at the widget, does something to it and sends a huge bill over to the widget’s owner. Said owner protests, saying “the fix cost $1 – please justify the rest of it”. The guy replies, “$1: fix. $ObsceneAmount – $1: knowing what to fix”.
That anecdote has real implications for Apple and the free software movement and Linux in particular.
For example, today I took a look at a Vista laptop that was bluescreening intermittently. Every few sleeps (not every sleep, nor every fixed x number of sleeps), it would start and immediately crash. And the question is, why?
Here are the steps I followed:
- Look at the log. Find out what the exact crash number was: 0x0000009F.
- Look at the Internet. Determine that 0x0000009F is a driver crash. So, it looks like a driver that’s not suspending properly.
- Look at the reliability monitor. What was installed about the time the first crash started happening? Interesting – no driver installs.
- Some more digging around logs shows the Realtek ethernet driver is crashing just as the machine is going into sleep.
- What? The laptop is connected via the Broadcom WiFi driver.
- Go back to the log to see when the driver was last updated. Nothing – still the original Vista driver for it.
- But what is this? “ZoneAlarm Free”?
- AHA!
- Fix it by installing a patch.
Total time needed to fix: 10 minutes. Maybe less.
Right now you’re thinking: ZoneAlarm? That firewall program that runs just fine on the XP machine sitting next door?
Of course, this requires some explanation … and some arcane knowledge of Windows. It’s probably pretty clear to you that ZoneAlarm works by getting in between the user side and hardware side of transmitting and receiving stuff from networks. Exactly where doesn’t really matter in this case. However, this was updated for Vista, wasn’t it? This is the Vista version, so the changes in the networking design were accounted for!
However (and this is the bit of arcane knowledge you need to know to make the connection between ZoneAlarm and a blue screen), one major, non-networking change was made to Vista. You may recall setting a XP machine to go to sleep, or telling it restart, or ordering it to shut down, and coming back x units of time later and finding that Word or Notepad or Paint was waiting for a response, so the XP machine hadn’t done what it was told. What Microsoft did was build in a maximum time that Vista would wait for an application to respond to such a request to sleep, restart or shutdown. Then it would simply terminate the program or freeze the contents of the memory.
And now you see what is happening: ZoneAlarm, like every well-designed program, unhooks itself from the networking stack when it receives such a request. It’s wise and I wish more programs did so: get out of the way when the OS is doing something ridiculously complicated like suspending dozens of devices and hundreds of programs and preparing the computer to start right back up again. The problem is, when such a thing is happening, there’s a lot to do – and a lot depends on random factors like when was the last message from the network connection, where the computer is writing stuff to the disk and so on. So if ZoneAlarm was waiting for the final “goodbye” message from the router and it arrived after Vista had decided time was up, the computer suspended with ZoneAlarm still hooked into the network stack. Not good.
Thus why it was so random: sometimes the wireless router responded in time, sometimes not. And depending on when the reply arrived late, the network driver would crash and messily take the rest of the system with it.
Now there are good arguments that the network driver should be protected from doing something like corrupting the core of the OS – and there is a great deal of isolation already – but sometimes, crap happens. The question is how do you learn to fix it?
And so now you see the problem with Mac OS or Linux: if something similar happened in either of those operating systems, besides searching the Internet in the hopes of an answer, I would have NEVER figured out a firewall was causing the computer to crash when it was resuming. To me, it would have smelled of a bad driver or bad hardware – exactly what happened here.
Thus even though power users are by nature the ones most fascinated by other operating systems (I have a Wubi installation of Ubuntu on this very Windows machine, and an Apple Mac Mini), they are also the ones with the most to loose: all those years of knowledge gathering about the internals of Windows, obscure settings and arcane know-how all goes to waste. There’s no way to get around it.
So if Mac OS or Linux want to be successful, then they have to get people early, before they go down the Windows path. Even if Windows users are willing to unlearn the Windows way of doing things, as I am, it takes a long time to rebuild a knowledge base to fix things that go wrong. Since things will always go wrong, and people will always build up such a base, consciously or unconsciously, the only alternative is an early catch so people don’t get frustrated their knowledge is going to waste.
In the meantime, I look forward to another crash free Windows laptop.
Categories : bug, computers, linux, mac os, platform independence, windows





