Log in

GPU computing 
26th-Jan-2013 06:36 pm
Thanks to an article on SemiAccurate I learned about new AMD gadget called, wait for it, Gizmo. You can see it here.
Gizmo was most likely inspired by the success of Raspberry Pi as dev boards like this existed before but were never this cheap, or even available to general public. Let's compare it to RPi and Intel products then :P

- It's a PC (even runs Windows 7 since that's how AMD guys measured the silicon temperatures under stress).

First, it needs 3V "button" lithium battery, which is mandatory but apparently not part of the kit. In fact it has to be a battery with wires and a small plug, like in laptops, so forget about buying it in TESCO. Tsk.
Then you'll need a SATA hard drive, or SSD, so again forget about cheap SD cards. I suppose a CF card with IDE-to-SATA interface might do the trick if you don't need performance.
Lastly it will obviously need more power than a USB phone charger can provide, much more. The good news is it will accept anything from 9 to 24 volts, so it can be run on 12V lead-acid (car) battery for example.

So, compared to RPi it's not really that great for small projects. It's on par with some of the Intel N-series Atom ITX boards, like D945GSEJT or DN2800MT. Its form factor places it somewhere between RPi and ITX.

- It needs cooling (unless used in a lab environment).

While the board is all passive-cooled it's clearly stated in the docs that this is just enough for 25C ambient temperature and only without a case. If you want to put it in a case or use at higher temperatures you'll need to add a fan to the CPU radiator. There is a fan connector on the PCB for that purpose, though I would've like bigger heatsinks.
The CPU itself is rated at 6.4W but there's also the companion chip (the "south bridge") to consider. The VRM section is also going to generate some heat but I assume it can deal with it in most situations.

Again, a win for RPi and possibly also for the two Atom boards I mentioned since these will work in a case if there is enough convection present. I've seen fanless cases for these Atoms boards so it can be done. Obviously though it depends a lot on where that case will be put :) It might work in an air-conditioned room but not otherwise in summer heat. It's not black and white here.

- It's an APU.

And now we're talking. It's not that much smaller than ITX board and possibly runs hotter so does it have any good sides to it? Yup, the computing power available.
It's a dual-core fully out-of-order AMD64 architecture CPU clocked at 1GHz. That might not look very impressive compared to 1.86GHz N2800 Atom, which is also 64-bit capable and dual-core, with Hyper-Threading to boot, but Atoms are in-order architecture. Turns out it's difficult to make code that would not choke in-order CPUs so much. The compilers are to blame although some code (semi-random branching for example) is just not predictable enough to properly optimize.
The APU is not just CPU though, it's also the GPU next to it. Radeon HD 6250 in this particular case, with 80 shaders clocked at 280MHz.

So why exactly is a measly mobile GPU, the lowest of all AMD has to offer, that much of a win? Because its 80 shaders equal to 1 compute unit (CU), and you can do other stuff with it than just drive VGA output.

To make a point here I've run some tests. My code was trying to brute-force crack M4-type encryption key from dumped NAOMI data. These keys are only 32 bit long and the encryption algorithm is not even that complicated once you see it - again, thanks to Andreas Naive for making "obvious" things actually obvious to us, mere mortals :)
I wrote a cracker in C that, given a key, will decode 8 bytes of data and compare it with known pattern to check for match. To scan entire key space you need to run this code 4294967296 times. A typical, simple approach would be to create a cracking procedure that takes a key value as an argument and then make a loop that will call this procedure 2^32 times, checking the result. Here's how long it takes:

* Intel Core2 Duo E6600
- 1 core @ 2400MHz (2nd core not used)
- full out-of-order architecture
- Windows 7 Professional 64-bit
- 64-bit code (MinGW64 4.5.3 -O2)
+ 415s

* AMD Athlon XP processor 1700+
- 1 core @ 1466.909MHz
- full out-of-order architecture
- Debian Linux, 2.6.32 kernel
- 32-bit code (gcc 4.4.5 -O2)
+ 937s

* Intel Atom N270
- 1 core @ 1596.095MHz (HT not used)
- in-order architecture
- Debian Linux, 2.6.32 kernel
- 32-bit code (gcc 4.4.5 -O2)
+ 2799s

* Raspberry Pi ARM11
- 1 core @ 900MHz (O/C, core @ 450MHz, SDRAM @ 450MHz)
- ARMv6 architecture
- Raspbian Linux, 3.2.27 kernel
- 32-bit code (gcc 4.6.3 -O2)
+ 3378s

As you can see it takes some time, and the in-order Atom and RPi ARM are especially bad at it. And my RPi is running overclocked, the typical values are 700MHz for CPU, 250MHz for core and 400MHz for SDRAM so in reality it's even worse. Obviously you don't want to run crackers on your small dev board but what if this was face/shape recognition based on images from small camera on a robot? That does seem like a plausible use case.

Now there's this stuff called OpenCL which lets you distribute your computation-heavy tasks over multiple CPU cores, and also GPU compute units. I used the same cracker, except the main loop was thrown out and replaced by OCL framework. Here's how it went:

* Intel Core2 Duo E6600
- 2 cores @ 2400MHz
- full out-of-order architecture
- Windows 7 Professional 64-bit
- OpenCl code (AMD APP 2.6)
+ 105s

* AMD/ATI Radeon HD 5770
- 10 compute units @ 850MHz
- GPU architecture
- Windows 7 Professional 64-bit
- OpenCl code (AMD APP 2.6)
+ 6s

Yeah, that's whole 6 seconds. Not all code gets that much of a boost on GPU, this one was integer based with some logic operations but didn't have many branches in it. Even the CPU version got twice as fast as simple C code, most likely due to aggresive compiler optimizations - most loops had just 4 passes so it's a great place to unroll and use SSE2 vectorization.

Now, my 5770 has 10CUs clocked at 850MHz so in total 8500PU - "power units". It run for 6 seconds so it needed 51000PUs to complete the task. The 6250 has only 1CU at 280MHz so 280PUs total. 51000/280=182 seconds. In reality probably a bit more due to slower data transfers. Compare that to Atom results and you'll see why having that GPU is important :)
With dual-core CPU you can easily run a lot of data processing and offload the really heavy stuff to GPU, so it appears to be a great dev board for more advanced projects.

Now why did I bother with this long-winded explanation? Well, it looks like AMD has got all three next-gen consoles in the bag. We've had a lot of "insider leaks" lately, most of it is wishful thinking taken for gospel, especially when it comes to fanboys. Silly people. It's not about raw power anymore. Consoles will not be able to beat PCs with the numbers, not unless you want them to draw 1kW of power and cost the same as rack full of servers. It's about being smart with what limited resources you have. One can argue that's always been the case but this generation will show it even more. A typical PC that can run games in 1080p in 3D at 60fps would need some 300-400 Watts of power. Next gen consoles are promising the same level of fidelity (well, we shall see about that I guess) at half that power. This is what I find most interesting. I couldn't care less if the CPUs are 1.8 or 3.2GHz and how may gigabytes of RAM there are inside.

BTW, I've made some additonal calculations. My RPi runs on 5V and draws 0.5A so it used up 5V * 0,5A * 3378s = 8445Ws to get the calculations done. My Radeon 5770 has 108W TDP so let's assume I actually hit that, and that the rest of my PC drew 150W, which is VERY safe assumption as CPU was idle and so were the HDDs. (108W + 150W) * 6s = 1548Ws. So not only it was faster but also used less power :) Nice things, these compute units. With 16 thousands 128-bit wide registers it's no wonder each takes so much silicon space.
This page was loaded Mar 27th 2017, 8:26 am GMT.