In the previous article I measured random memory accesses and found them to roughly follow O(√N). But why? Obviously the direct culprit is the cache hierarchy. With smaller problem sizes we can use the quicker caches. As my memory consumption goes up, I have to rely on slower and slower memory to get the job done, ultimately swapping to disk.
Now you may be thinking that this is all trivial. Surely I (or someone richer than me) could purchase enough fast L1 type memory to fit all the data, and that would yield a flat graph of O(1) memory accesses. Sadly, 6 GiB of L1 memory is not only expensive, but would also be way more than could fit on a CPU die. It would need to be further away, increasing the latency of the data. And if the problem grew even more we would need even more RAM taking up even more space requiring it to be even further from my CPU, making it slower still. But how much slower? Let’s do some thinking.
The round library
Allow me to set up a thought experiment! Let us say you are a librarian working in a circular library, with your desk at the center. The time it takes to to fetch any one book is now bounded by the distance you have to walk, and the worst case scenario will be walking to the edge of the library, i.e. the full radius.
Now let's say your sister is working in another library which is also circular, but with twice the radius. She may be required to walk twice the distance to retrieve a book. However, her library has four times the area of your library and thus can contain four times as many books. In general, the amount of books N that fits in a library is proportional to the square of the radius r of the library, and we write N∝ r². And since the time T it takes to retrieve a book is proportional to r we can write N∝ T² or T∝√N or T=O(√N).
This is roughly analogous to a CPU which needs to retrieve a piece of memory from it's library: the RAM. Of course the speed of the "librarian" matters, but that is bounded by the speed of light, and there is not much we can do about that. For example, within one clock cycle of a 3GHz CPU, light reaches about 10 cm. So to roundtrip, any memory that should be instantly accessible can be at most 5cm (2 inches) from the CPU.
So how much information can we fit within a certain distance r from the cpu? Above we assumed a circular flat library - but what if it was spherical? The amount of memory that could fit within a radius r would then be proportional to r³. In practice computers are actually rather flat - this is due partly to form-factor, but also due to problems with cooling. Maybe one day we will figure out how to build and cool three dimensional blocks of memory, but for now the practical limit of the amount of information N within a radius r seems to be N ∝ r². This also holds true for even more distant memory, such as data centers (which are spread out on the two-dimensional surface of the earth).
But can we, in theory, do better? To answer that, we need to learn a bit about black holes and quantum physics!
The theoretical limit
The amount of information that can fit within a sphere with radius r can be calculated using the Bekenstein bound, which says that the amount of information that can be contained by a sphere is directly proportional to the radius and mass: N ∝ r·m. So how massive can a sphere get? Well, what is the most dense thing in existence? A black hole! It turns out that the mass of a black hole is directly proportional to its radius: m ∝ r. This means that the amount of information that can fit inside a sphere of radius r is N ∝ r². And so we come to the conclusion that the amount of information contained in a sphere is bounded by the area of that sphere - not the volume!
In short: if you try to squeeze too much L1 cache onto your CPU it will eventually collapse into a black hole, and that would make it awkward to get the results of the computation back to the user.
So we come to the conclusion that N ∝ r² is not only a practical limit, but also the theoretical limit! This means the laws of physics sets a limit of the latency of memory: To access any of N bits of data, you will need to communicate a distance that is proportional to O(√N). In other words, for each 100-fold increase in problem size, we expect to see a 10-fold increase in the time it takes to access one element. And that is exactly what our data showed in part one!
So we now have our two pieces of evidence: practical and theoretical. So what can we learn from it? Continue to the concluding Myth of RAM, part III