In the first part of this article, we looked at the fundamental building blocks of a CPU: logical units, instruction sets, and architectures. These have formed the foundation of CPUs right from the earliest days of personal computing, and still govern the way today's PCs, laptops, phones, tablets, appliances and accessories are developed. While the scope and scale of CPU design has changed over the years, these concepts are timeless.
32-bit vs 64 bit
In the last few years, we've witnessed a slow transition from 32-bit computing to 64-bit computing. Hardware has been 64-bit capable since 2003, when AMD's Athlon64 first released, but software support only really became mature with Windows 7 in 2009. Windows XP had sort of served as experimental 64-bit platform, with not enough device drivers ready to make a proper jump. Microsoft Windows Vista didn't gain enough traction itself, so it was Windows 7 that ushered in the 64-bit era with drivers, OS and processor hardware all fully compatible with each other.
Apple also started moving away from 32-bit systems after Snow Leopard. All subsequent versions of Mac OS X have been 64-bit only. More recently, Apple announced the A7, the first 64-bit chip to hit phones and tablets. The A7 is based on an ARMv8 design and is codenamed "Swift". Intel followed with the Silvermont microarchitecture (branded Atom for the public) later the same year.
So what is this 64-bit thingy, and why is it such a big deal?
Remember those registers we discussed in Part 1? On a hardware level, 64-bit registers can hold 64 bits (binary digits, 0 and 1) and so for the duration of each clock cycle, more data can be held by a 64-bit register than a smaller one.
The other half of the puzzle is software, which must be aware of the fact that it can use 64-bit hardware so that when the CPU finally receives its instructions, they are 64-bit aware and the largest blocks of data are also 64 bits in size.
All this translates into additional speed for programs can take advantage of the extra resources. It also allows for more op codes to be sent to the CPU and also for more permutations and combinations of bits.
Most programs that were designed for 32-bit environments can run just fine on a 64-bit operating system (and 64-bit hardware) by emulating a 32-bit environment (in other words, the program doesn't know that it's not running on 32-bit hardware or software). Unfortunately this does not work the other way around.
On the other hand, device drivers need to be 64-bit on a 64-bit OS. Drivers are low-level software interfaces that help the operating system recognise and use the full capabilities of hardware such as graphics cards, scanners, printers, etc.
Single-core, dual-core, quad-core and beyond
So what are these "cores" everyone keeps talking about? A CPU in the old days would just follow the basic outline we described in Part 1. Today, a lot has changed. CPUs now usually need to process multiple instructions at once, and therefore have multiple "cores", each of which has resources for performing calculations, almost equivalent to an entire processor. Core designs vary between architectures, but each core typically has its own arithmetic and logic unit (ALU) as well as floating point unit (FPU).
Additionally, each core gets a block of super-fast cache memory, about which we'll talk about in the next section. Other fixed function logic may also be integrated into a core. These logic blocks help speed up very specific algorithms and operations such as encryption and video encoding. Such tasks are thus also offloaded from the main logic.
The cores are fed by an instruction pipeline, which predicts and pre-loads instructions from memory, keeping them handy so no part of the CPU is left waiting inefficiently. There's also a common memory controller and various other blocks.
CPUs, by nature, process blocks of data sequentially and in rapid succession. While the CPU cores can perform up to a few billion instructions per second, main memory usually can't be read fast enough to keep up, and so a CPU can spend a lot of time waiting for data to be fetched from RAM. Registers are used to keep things inside the cores, but there can only be a handful of them, so larger blocks of memory called caches are used.
There are multiple cache levels, numbered in order of importance. Typically, each core gets its own Level 1 cache, each pair usually shares a Level 2 cache, and the entire cluster gets a common Level 3 cache.
CPU die space is at a premium, and the closer you go to the cores, the more expensive it becomes to add memory caches there. However, the closer you are, the faster you can move data. Registers inside the CPU have the lowest capacity, but also the highest speed. L1 cache memory is a little bigger but slower. The L2 and L3 caches follow the same pattern. Beyond these capacities, data and instructions need to be stored in and retrieved from RAM outside the CPU, which has much more space but is also much slower.
Good programmers will try and fit most of their code within the size limits of the caches, so that load/store calls to main memory are minimised. The CPU will try to predict what data will be used and make sure it is stored in the cache ahead of time, to improve performance.
So now that we know all this, how do we know how to compare processor performance? Well, there are two metrics that you'll see in marketing materials, and a third that you won't.
The first is the number of cores. Usually, the more cores the better, all else being equal. What isn't usually stated is that software must be able to use those cores effectively. For normal consumer applications, a fast dual-core CPU is usually good enough, and four cores is a luxury.
Multi-threading is a way to make an operating system believe that one core is in fact two. This lets a CPU make more efficient use of idling logic circuits on each core. Multi-threading provides about 20-25 percent more performance per core, with comparatively little increase in power consumption. Intel calls this Hyper-Threading, and not all CPUs support it. For example, desktop Core i3 CPUs typically have two cores with Hyper-Threading, and Core i5s have four physical cores without it.
CPUs with multiple cores are more expensive to produce, and while companies like Intel can certainly stuff more than four physical cores into CPUs for the consumer market, there is little or no benefit in terms of performance. The "prosumer" and professional markets have different demands, and so six-core and greater CPUs do exist.
More CPU cores can speed up some tasks such as video editing, scientific computing, developing complex software applications, and running a high-performance data server. Others such as games do not benefit much going beyond four cores. For regular word processing, office tasks and browsing the Web, you'll be fine with a fast dual-core CPU, with Hyper-Threading if possible.
Clock speed is perhaps the best-known measure of performance. A clock is basically a regular digital pulse generated by a vibrating quartz oscillator on the motherboard. That clock signal is sent to every component in the computer, including the CPU, and is used to keep all units in sync.
This clock signal is received and multiplied by the CPU, which performs a fixed number of instructions on each resulting pulse. The higher the multiplier, the more instructions processed per second but the more energy consumer. The multiplier can scale up and down between various performance states or "p-states" depending on the CPU's workload, to save power when full speed isn't required. On the other hand, speed can be increased in bursts, sometimes called "Turbo Boost", when it makes more sense to apply more power to a task and get it done quicker.
A higher clock speed results in greater performance, if all other factors such as the microarchitecture and core count are constant. Clock speeds are not directly comparable across architectures. With the same microarchitecture but with different core counts, performance will depend on the software being run: programs that can't take advantage of multiple cores will exhibit the same performance irrespective of the core count. In such cases, a dual-core CPU with a higher clock speed will outperform a slower quad-core CPU.
Instructions Per Clock is basically a measure of how many instructions can be executed in each clock cycle. This varies between microarchitecture implementations. Increasing the IPC value is very, very hard to do after a point. This number is usually not mentioned, and is usually only evident through specific tests and benchmarks.
Benchmarks are performance tests that return standard results for known configurations. The results returned by everything else can then be compared with the standard result to see how much better or worse things get with each change.
The other way to look at benchmarks is in terms of time: if CPU A takes one minute to complete the benchmark and CPU B takes two minutes, then CPU A is twice as fast. Games are commonly used as benchmarks because the number of image frames rendered per second can be used as a metric: the higher the number, the higher the performance.
Where one must be careful with benchmarks is in knowing exactly which variables are changing and affecting the result. For example, a benchmark might show that gaming performance does not improve when moving from a low-end CPU to a high-end one because it is being limited by an external factor: the GPU's performance. In this case, the purpose of measuring the difference between CPUs is defeated.
On the other hand, an application that uses only one CPU core will favour higher clock speeds (assuming the same or similar microarchitectures); while one that uses as many cores as possible will likely post better results with a higher core count despite a lower clock speed.