The Myth of RAM, part I

Preface

This article is the first of four in a series, in which I argue that thinking of a memory access as O(1) is generally a bad idea, and we should instead think of them as taking O(√N) time. In part one I lay out a hand-wavy argument based on a benchmark. In part II I build up a mathematical argument based in theoretical physics, and in part III I investigate some implications. Part IV is a FAQ in which I answers some common quations and misunderstandings.

(This preface was added on August 29, 2016)


Intro

If you have studied computing science, then you know how to do complexity analysis. You'll know that iterating through a linked list is O(N), binary search is O(log(N)) and a hash table lookup is O(1). What if I told you that all of the above is wrong? What if I told you that iterating through a linked list is actually O(N√N) and hash lookups are O(√N)?

Don't believe me? By the end of this series of articles you will. I will show you that accessing memory is not a O(1) operation but O(√N). This is a result that holds up both in theory and practice. Let’s start with the latter:

Measuring it

First of all, let's get our definitions straight. Big O notation can be applied to many things, from memory usage to instructions executed. For the purpose of this series of articles I'll be using the O(f(N)) to mean that f(N) is an upper bound (worst case) of the time it takes to accomplish a task accessing N bytes of memory (or, equivalently, N number of equally sized elements). That I use Big O to analyze time and not operations is important. As we'll see, the CPU spends a lot of time waiting for memory. Personally, I do not care what the CPU does while I wait for it - I only care about how long something takes, hence the definition above.

EDIT: Another clarification: the "RAM" in the title is meant to refer to random memory accesses in general, not to a specific type of memory. In this installment I measure the time it takes to access a piece of memory be it in cache, in DRAM, or in swap. In Part II I will extend this even further - but I'm getting ahead of myself.

Here’s a simple program that iterates through a linked list of length N, ranging from 64 elements up to about 420 million elements. Each node consists of a 64 bit pointer and 64 bits of dummy data. The nodes of the linked list are jumbled around in memory so that each memory access is random. I measure iterating through the same list a few times, and then plot the time taken per element. This means that we should get a flat plot if random memory access is O(1). This is what we actually get:

The cost of accessing one node in a linked list of a given size. Accessing a random element in a 100MB list is approximately 100 times slower than accessing an element in a 10kB list.

The cost of accessing one node in a linked list of a given size. Accessing a random element in a 100MB list is approximately 100 times slower than accessing an element in a 10kB list.

Note that this is a log-log graph, so the differences are actually pretty huge. We go from about one nanosecond per element all they way up to a microsecond! But why? The answer, of course, is caching. The system memory (RAM) is actually pretty slow and far away and so to compensate the clever computer designers have added a hierarchy of faster, closer and more expensive caches to speed things up. My computer has three levels called L1, L2, L3 of 32 kiB, 256 kiB and 4 MiB each. I have 8 GiB of RAM, but when I ran my experiment I only had about 6 GiB free - so in the last runs there was some swapping to disk (an SSD). Here’s the same plot again but with my RAM and cache sizes added as vertical lines:

The vertical lines corresponds to L1=32kiB, L2=256kiB, L3=4MB and 6 GiB of free RAM.

The vertical lines corresponds to L1=32kiB, L2=256kiB, L3=4MB and 6 GiB of free RAM.

This graph should really drive home the importance of cache:s - each cache is many times faster than the previous one. This is the reality of modern cpu architectures, be it on smartphones, laptops or mainframes. But what is the pattern? Can we fit some simple equation to the plot? It turns out we can! If you look at the plot you'll see that between 10kB and 1MB there is a roughly a 10x slowdown. Between 1 MB and 100 MB there is another 10x slowdown - and again between 100 MB and 10 GB! It seems like that for each 100-fold increase in memory usage we get a 10-fold slow-down. Let's superimpose that on the graph:

The blue line is O(√N).

The blue line is O(√N).

The blue line corresponds to a O(√N) cost of each memory access. Seems like a pretty good fit, no? Of course this is on my machine, and it may look quite different on yours. Still, the equation is very simple to remember, so maybe we could use it as a rule of thumb.

(EDIT) You may now wonder what will happen once we hit the right side of the graph. Will it keep climbing, or plateauing? Well, it would plateau for a while until we could no longer fit the memory on SSD and would need to go to HDD, then a disk server, then a data-center far away etc. Each such jump would create a new plateau, but the upwards trend of the latency, I will argue, would continue. The reason I did not run my experiment further is simply lack of time and lack of access to a big data-center.

"But", I hear you say, "empiricism is no way to establish a Big-O bound!". It certainly isn't! Could there perhaps be a theoretical limit to the latency of a memory access? The plot thickens in The Myth of RAM, part II.