This article was written for my own reference to help better understand how memory and CPU cache works in general. These notes are pulled from official document called “What Every Programmer Should Know About Memory”. Please note that I am beginner that wants to learn more about this field and, if this article catches your attention, I highly suggest moving on with reading entire PDF. This article covers just basics (few first pages of the document) that I find important to grasp. Big thanks to Ulrich Drepper, author of the e-book, for sharing this amazing knowledge.
- We understand that programs are loaded into the memory. In order to modify/manipulate data, CPU needs to fetch this data from RAM. Memory access (process of getting this data from RAM) will always be bottleneck for most programs because CPU cores become much faster and numerous, which translates to high CPU wait time to get this data from RAM. In general, most of the time CPU will have to wait (be idle) until data that he needs is pulled from RAM.
- This article will explain to users how programmers can help to minimize this CPU wait time, how to optimize memory access, explain why CPU cache is developed and how it works, and what programs can do to achieve best performance by utilizing CPU cache.
- Back then, computers were much simpler because components such as CPU, RAM, HDD, network interfaces, were quite balanced in their performance. Nowadays, some components have better development compared to others, which eventually creates bottlenecks of data transfer. For example, CPU speeds now are hugely improved over the last decade, whereas storage devices are falling behind. Even newest and fastest SSDs cannot compare speeds with CPUs, also RAM is hundreds of times faster than fastest SSD out there, etc. These gaps in speed are reasons why bottlenecks are created, which translates to slower data transfer between computer components.
- To mitigate these bottlenecks, people started creating software techniques that could improve this wait time. This is when caching was introduced. The operating system will cache (temporarily store) most frequently used data in RAM, which can be accessed when needed (much faster). With this method, HDD/SSD will not be queried for data (slow). Instead, CPU will query RAM (faster) to access certain data. If not found in RAM, operating system will seek it in HDD/SSD. Furthermore, cache storage has been added to HDD/SSD themselves (I suppose this is used to prevent device from looking for data, but to try to serve it to CPU immediately)
1. Hardware today
Personal computers and servers standardized on a chipset (collection of chips) with two parts: Northbridge and Southbridge
All CPUs (CPU1 and CPU2 from image above) are connected to the Northbridge via Front-side Bus (FSB). The Northbridge contains memory controller and its implementation determines the type of RAM chips used for the computer. Different types of RAM (DRAM, Rambus, SDRAM…) require different memory controllers (that are inside N/B).
To reach to other system devices, Northbridge communicates with Southbridge. Southbridge handles communication with devices through dozens of buses (PCI, PCIe, SATA, USB buses for example). Back then, systems had AGP slots which were attached directly to Northbridge. This connection was done for performance reasons because connection between Northbridge and Southbridge had its own bottlenecks. Today, all PCI-E slots are all connected to Southbridge directly.
All data communication from one CPU to another must travel over the same bus used to communicate with the Northbridge, that is throught FSB. All communication with RAM must pass through Northbridge. Communication between CPU and a device attached to Southbridge, is routed through Northbridge.
What are problems with this design?
Happens when devices need access to RAM. In the early days, all communication with devices on either bridge had to pass through CPU (impacting system performance). Nowadays, almost all devices have direct memory access (DMA) which allows devices (with the help of northbridge) to store and receive data in RAM directly without the intervention of CPU. Nowadays, almost all devices connected to any type of bus (PCI, PCIe, SATA, USB…) can use DMA to access RAM anytime without consulting with CPU. This solution reduces the jobs for CPU but increases congestion through Northbridge as DMA requests compete with the CPUs requests to the RAM.
Involves the bus from the Northbridge to the RAM. On older systems there is one bus to all RAM chips so parallel access (i.e two CPUs try to access same RAM chip at the same time) was not possible. Today, RAM requires two or more separate buses (channels) which doubles/triples available bandwidth. With limited bandwidth available, memory access must be scheduled to minimize delays. As we will see, processors are much faster and they wait to access memory because memory is slow and it needs time to pull out stored data. If multiple CPUs or cores try to access memory at the same time, wait times for memory access are even longer. This is also true for DMA requests to the RAM (remember, devices attached to Southbridge can use DMA to access RAM and avoid CPU).
On some systems, the Northbridge does not actually contain memory controllers (controller is not burned into the Northbridge). Instead, Northbridge can be connected to a number of external memory controllers (MC[1..4]), like shown below
Great thing about this architecture is that more than one memory bus exists (4 in this case) and therefore total available bandwidth increases. This design also supports more memory. With this architecture, we have multiple memory accesses that can happen at the same time and they will not create delays/waiting because more bandwidth is present. This is especially true when multiple processors are connected to Northbridge, as in the image above (there are two FSB now),
Using multiple external memory controllers is not the only way to increase memory bandwidth. One other increasingly popular way is to integrate memory controllers into the CPUs and attach memory to each CPU. In this image, each CPU has its own memory controller (MC), and RAM is attached to every MC. Also, CPUs are connected to each other.
However, disadvantages to this architecture are present. Because this system needs to make all memory accessible to all CPUs, the memory is not uniform anymore (hence NUMA – Non-Uniform Memory Architecture). When memory is attached to CPU (like RAM with CPU1), data can be accessed from RAM at usual speeds (depends on one bus connected to RAM1). Situation changes when CPU1 wants to access data from RAM2. When CPU1 wants to access memory from another (e.g RAM2), the following happens. Memory (RAM2) is attached to another processor (CPU2), then interconnects between processors are happening. To access memory attached to CPU2 from CPU1 requires communication across one interconnect. When the same CPU accesses memory attached to CPU4, two interconnects have to be crossed.
Now, each interconnect has a cost. We talk about “NUMA factors” when we describe the extra time needed to access remote memory (like in example above).
3. CPU Caches
In early days, the frequency of the CPU core was somewhat equivalent to the frequency of the memory bus. Memory access was a bit slower than register access by CPU. This changed when CPU designers increased the frequency of the CPU core but the frequency of the memory bus and performance of RAM chips did not increase that much. Faster RAM can be built, but it is not economical.
A computer can have small amount of high-speed SRAM and large amount of DRAM. One implementation would be to dedicate area of address space of the processor as containing the SRAM and the rest to the DRAM. In this scenario, SRAM would serve as an extension of the register set of the processor, as it is much faster than DRAM.
- In previous examples, we have seen that CPU is attached to memory controller, but now it is ‘attached’ to CPU cache and all reads and writes must go through cache.
- The connection between CPU and cache is fast, special connection. Moreover, cache is connected to the same Front-side Bus as Main Memory is. The topology can be pictured as: CPU Core > Cache > Front-side Bus > Northbridge > RAM channels > RAM.
- The ‘von Neumann architecture’ says that it is better to separate the caches used for code (CPU instructions) and for data. Intel have used separate code and data caches since 1993.
- After L1 cache was introduced, L2 was created soon after (bigger but slower). Increasing the L1 size is not economical.
In addition we have processors which have multiple cores and each core has multiple threads. Difference between core and a thread is that separate cores have separate copies of all the hardware resources. The cores can run completely independently unless they are using the same resources. For example, connections to the outside at the same time. Threads on the other hand share almost all of the processor’s resources. Intel’s implementation of threads has only separate registers for the threads and some registers are shared. Picture for the modern CPU looks like:
Figure above explained
- We have two processors (larger grey rectangles)
- Each CPU has two cores (smaller grey rectangles)
- Each core has two threads (orange rectangles)
- Threads share L1 caches (light green rectangles)
- All cores of the CPU share higher level caches (darker green rectangles represent L2 and L3 caches)
- The two processors do not share any caches (each processor has its own cache)
Cache operation at higher level
- By default, all data read or written by the CPU cores is stored in the cache!
- There are memory regions which cannot be cached, and there are also instructions which allow programmer to deliberately bypass certain caches.
- If the CPU needs a data, then data caches are searched first. As we know, cache cannot contain the content of the entire main memory, but since all memory addresses are cacheable, each cache entry has tag using the address of the data word in the main memory. In this way, request to read or write to an address can search the caches for a matching tag. The address in this context can be either the virtual or physical address, based on the cache implementation.
- When memory content is needed by the processor the entire cache line is loaded into the L1d. The memory address for each cache line is computed by masking the address value according to the cache line size. For a 64 byte cache line this means the low 6 bits are zeroed. The discarded bits are used as the offset into the cache line. The remaining bits are in some cases used to locate the line in the cache and as the tag. In practice an address value is split into three parts. For a 32-bit address it might look as follows: