Wednesday 31 March 2010

Hardware Fuckup

Most of the grumbling herein refers to software not acting the way it should. Of course, computer fuck ups are not solely related to software. I therefore take this opportunity to present the saga that was my recent hardware failure fuck up.

There comes a time in a PC owner's life when the game plays like a slide show, video takes an age to encode and Windows takes about three ice ages to boot. It is the time of an upgrade. This year, I reached such a time. My general rule is that if it doesn't double performance, it's not worth buying. I was running a 2.2Ghz Athlon X2, with 2Gb of RAM and a 9600GT graphics card. I was also using a 750Gb HDD.

When I purchased my last motherboard I wanted to keep the RAM I had so I went for the socket 939 chipset instead of the newer AM2. This was, ultimately a mistake, because it meant that this time around I could not just buy a brand new AM2/AM3 Athlon/Phenom. The only 939 chips still on sale are only marginally faster than the chip that I had. So, new platform then.

At the time of writing there is a clear leader in home PC enthusiast CPU's. It is the Intel i5 750. This is a Quad core processor running at 2.66Ghz. It has a nifty facility whereby one can reduce the number of cores in use and boost the speed up to a max of 3.2Ghz. You get the best of both single/multi core worlds in other words.

The motherboard I went for is the Gigabyte GA-P55M-UD2. It comes highly recommended for the i5 750. To this I added 4Gb of G.Skill 12800 DDR3 RAM. All that guff means that it is rated to run at (12800/8(bits)) or 1600Mhz.

My biggest decision was whether to get a new graphics card (specifically the ATI 5770) or a Solid State Disk. A 64Gb SSD is about the same price as that graphics card. It is also the first size of SSD that actually makes sense. You can install Windows and a couple of games in 64Gb. Large video files, iTunes libraries and so forth, do not require a speedy SSD, so that can happily still take up room on the 750Gb HDD.

In the end I went for the SDD. I picked up a OCZ Agility drive. I have to say it is not a decision I have regretted. The drive has the fancy new 'Trim' enabled firmware and has performed flawlessly. Until you have actually used an SSD based PC you do not really have an appreciation for how much of a speed increase you get.

As for the graphics, it turns out that the CPU was something of a bottleneck in my old system because the new platform gets much more out of the 9600GT. I can now run the games which really tested my previous set up (Crysis, GTA IV, World in Conflict and Assassin's Creed) on mostly full settings. Bear in mind that my monitor is a 19" 1440x900 job, so the 9600GT does not have to pump around too many pixels.

Anyway, the system ran perfectly sensibly for a month. Then, all of a sudden, it just stopped. As in power off, no start up when button pressed. Now, I admit I had applied a modest overclock to the CPU. If I recall it was running about 3.3Ghz instead of 2.66Ghz. You will find plenty of reports about the chip running at 4Ghz, so I thought I was a-ok.

I am no stranger to overclocking failures, so I tried the usual method of restoring the machine. Step one, wait for the system to cool down. Nothing. Step two, clear the CMOS chip. Nothing. Ah. Something bad has happened. Try another power supply. Oh - lights and the CPU fan spins! For 500 milliseconds. Bad. However the other power supply only had the 20 pin power cable, not the beefier 24 pin. Plus it only had a 4 pin 12v cable and the motherboard needs an 8 pin connection. So, test the first power supply on another machine. Nothing - A HA! Dead power supply. Odd, but not impossible. You see I did not change power supplies when I upgraded the kit. The existing one was a beefy 500W Enermax unit, with all the necessary connections. However it looked like it just wasn't up to the job of overclocking the new kit. So, I order a new 600W power supply. It arrived next day and I fired it up. It behaved in exactly the same way as the older PSU - lights, brief fan spin. Oh bollocks.

I now have RAM, a CPU and a Motherboard. All could be faulty. I have no other machines which use this platform so I cannot test any of them. On balance, and because it is the cheapest to replace I opt to send back the motherboard first. I bought the stuff from ebuyer.com. Their policy is not to accept any returns if you have had the kit for more than 30 days. They just direct you to the manufacturer. This is extremely frustrating. And not entirely legal. There is nothing in the Sale of Goods Act which allows them to adopt this rigid attitude. However, when you have been overclocking you cannot really complain too much. So I emailed Gigabyte. I have not used a manufacturers RMA system before, but I have to say that Gigabyte were excellent. Highly recommended. They asked for some details, and gave me an address and reference number. I boxed up the board and sent it recorded delivery to Gigabyte. Cost was £5.

I sent it to them on a Monday, and by Friday it was back in my hands. They had identified some components on the board which had failed and they had replaced them. Good stuff think I - problem found and fixed. The components were the VRMs which also explained why the PSU had failed.

So, I install the board back into the machine, hook up the new PSU and let rip - and it works. Brilliant. I get all my stupid DRM licences re-activated and get back to the video encoding I was doing when it all went south the first time. I do not, however, overclock the machine. I installed Speedfan to monitor the CPU temperature. I had set the alarm temperature in the BIOS to 60C. I noted that Speedfan seemed to be reporting a temperature of 80C which was a little alarming. I left it encoding for a while, and came back to find Speedfan's graphs showing a temperature under load of more than 100C had been recorded. However, no alarms were going off, and the system seemed find. So I ran another encoding task. This time when I returned, however, the system was dead again. Exactly the same symptoms as the first time. Dead power supply, brief fan spin up. Oh, fuck.

I had had the PSU for less than 30 days, so I got straight on to ebuyer and reported it had failed. They asked me to return it to them, and it was picked up on the following Monday morning. I also went back to Gigabyte and reported a further failure with the board. I have to say at this stage that I had no idea what the problem was. I asked Gigabyte to carry out more detailed burn in checks with the board. I sent it back to them the same Monday.

My concern at this point was that the CPU has been fundamentally damaged by the original overclock - and that to the point where it was killing other components. Gigabyte actually asked me to send them the CPU but said they could take no responsibility for its safety. On balance I wasn't prepared to risk it - it could still be the board after all.

I spent some time researching the operating temperatures of the i5 750 and found out that 80C and 100C were way beyond what it should have been running at. My suspicions turned to the cooling system. I was using the reference intel heat sync and fan that had come with the chip. It uses this plastic push pin system to attach it to the motherboard. It struck me as badly designed and flimsy when I was installing it, but what do I know about heat sync design. I formed the view that this was the problem and needed replaced asap. The biggest and baddest heat sync and fan for the i5 750 is the Titan Fenrir. This is a frankly ludicrous 120mm fan and heat sync. So I ordered one while the board and PSU were away.

Both ebuyer.com and Gigabyte technical support had the components back in my hands by the Friday, which I do consider to be excellent service. In particular Gigabyte sent me photographs of the board being thoroughly tested and remaining stable.

I got all the components back and reassembled the machine with the Titan Fenrir. I was extremely apprehensive but fired it up. It happily loaded windows, and I checked the temperatures asap. 25C. Not 80C - 25C. The new fan had made a significant difference. Since then I have had no further failures.

So what was the actual problem with this? I put it down to the stupid Intel Cooler design. It is too easy not to get a decent fit to the motherboard. The Titan Fenrir, by contrast, has a backing plate and bolts, meaning that when it is fixed, you know it is fixed.

There is another possibility though. Remember the memory I got was rated to run at 1600Mhz? Well, the bus speed of the i5 750 is 133Mhz. The memory runs at 10x the bus speed. So how does the 1600Mhz work? It uses a system called XMP - which has a particular profile built into the memory chip which the BIOS on the motherboard can read. I had activated this (before each failure) without thinking too much about it. I looked more closely at it after the second failure. It turns out that what it does is raise the CPU bus speed to 160Mhz which then allows the memory to run at 1600Mhz. This means that if you then further over clock the chip you are going way beyond the original settings. This meant that when I did the first, modest over clock, I was actually stressing the system more that I had thought. This time around I have not activated the XMP. So far, so good.

Lessons learned? Intel heat syncs and fans are crap on the i5 750. XMP memory profiles may not be good.

Oh, and my i5 chip must be made out of asbestos. It's max temperature is 72C. It went way beyond that to the point where the power draw was enough to (presumably) fry the motherboard (twice) and 2 power supplys. At no point did it shutdown to protect itself - or the other components.

Finally it turned out my case did not have a speaker, so the BIOS was probably beeping away its warnings to no avail because no speaker was attached to the jumper. I have now attached a small stand alone speaker which should mean I get the warnings in the future.