CPU cache - WikiHQ

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations, avoiding the need to always refer to main memory which may be tens to hundreds of times slower to access.

Cache memory is typically implemented with static random-access memory (SRAM), which requires multiple transistors to store a single bit. This makes it expensive in terms of the area it takes up, and in modern CPUs the cache is typically the largest part by chip area. The size of the cache needs to be balanced with the general desire for smaller chips which cost less. Some modern designs implement some or all of their cache using the physically smaller eDRAM, which is slower to use than SRAM but allows larger amounts of cache for any given amount of chip area.

Most CPUs have a hierarchy of multiple cache levels (L1, L2, often L3, and rarely even L4), with separate instruction-specific (I-cache) and data-specific (D-cache) caches at level 1. The different levels are implemented in different areas of the chip; L1 is located as close to a CPU core as possible and thus offers the highest speed due to short signal paths, but requires careful design. L2 caches are physically separate from the CPU and operate slower, but place fewer demands on the chip designer and can be made much larger without impacting the CPU design. L3 caches are generally shared among multiple CPU cores.

Other types of caches exist (that are not counted towards the "cache size" of the most important caches mentioned above), such as the translation lookaside buffer (TLB) which is part of the memory management unit (MMU) which most CPUs have. Input/output sections also often contain data buffers that serve a similar purpose.

Overview

To access data in main memory, a multi-step process is used and each step introduces a delay. For instance, to read a value from memory in a simple computer system the CPU first selects the address to be accessed by expressing it on the address bus and waiting a fixed time to allow the value to settle. The memory device with that value, normally implemented in DRAM, holds that value in a very low-energy form that is not powerful enough to be read directly by the CPU. Instead, it has to copy that value from storage into a small buffer which is connected to the data bus. It then waits a certain time to allow this value to settle before reading the value from the data bus.

By locating the memory physically closer to the CPU the time needed for the buses to settle is reduced, and by replacing the DRAM with SRAM, which hold the value in a form that does not require amplification to be read, the delay within the memory itself is eliminated. This makes the cache much faster both to respond and to read or write. SRAM, however, requires anywhere from four to six transistors to hold a single bit, depending on the type, whereas DRAM generally uses one transistor and one capacitor per bit, which makes it able to store much more data for any given chip area.

Implementing some memory in a faster format can lead to large performance improvements. When trying to read from or write to a location in the memory, the processor checks whether the data from that location is already in the cache. If so, the processor will read from or write to the cache instead of the much slower main memory.

Many modern desktop, server, and industrial CPUs have at least three independent levels of caches (L1, L2 and L3) and different types of caches:

; Translation lookaside buffer (TLB): Used to speed up virtual-to-physical address translation for both executable instructions and data. A single TLB can be provided for access to both instructions and data, or a separate Instruction TLB (ITLB) and data TLB (DTLB) can be provided. However, the TLB cache is part of the memory management unit (MMU) and not directly related to the CPU caches.

;Instruction cache:

:; MicroOp-cache:

:; Branch target buffer:

:; Instruction cache (I-cache): Used to speed executable instruction fetch

;Data cache:

:; Data cache (D-cache): Used to speed data fetch and store; the data cache is usually organized as a hierarchy of more cache levels (L1, L2, etc.; see also multi-level caches below).

History

thumb|[[Motherboard of a NeXTcube computer (1990). At the lower edge of the image left from the middle, there is the CPU Motorola 68040 operated at 25 MHz with two separate level 1 caches of 4 KiB each on the chip, one for the instructions and one for data. The board has no external L2 cache.]]

Early examples of CPU caches include the Atlas 2 and the IBM System/360 Model 85 in the 1960s. The first CPUs that used a cache had only one level of cache; unlike later level 1 cache, it was not split into L1d (for data) and L1i (for instructions). Split L1 cache started in 1976 with the IBM 801 CPU, became mainstream in the late 1980s, and in 1997 entered the embedded CPU market with the ARMv5TE. As of 2015, even sub-dollar SoCs split the L1 cache. They also have L2 caches and, for larger processors, L3 caches as well. The L2 cache is usually not split, and acts as a common repository for the already split L1 cache. Every core of a multi-core processor has a dedicated L1 cache and is usually not shared between the cores. The L2 cache, and lower-level caches, may be shared between the cores. L4 cache is currently uncommon, and is generally dynamic random-access memory (DRAM) on a separate die or chip, rather than static random-access memory (SRAM). An exception to this is when eDRAM is used for all levels of cache, down to L1. Historically L1 was also on a separate die, however bigger die sizes have allowed integration of it as well as other cache levels, with the possible exception of the last level. Each extra level of cache tends to be smaller and faster than the lower levels. and Intel Ice Lake-based processors from 2018, having 48 KiB L1 data cache and 48 KiB L1 instruction cache. In 2020, some Intel Atom CPUs (with up to 24 cores) have (multiple of) 4.5 MiB and 15 MiB cache sizes.

Operation

Cache entries

Data is transferred between memory and cache in blocks of fixed size, called cache lines or cache blocks. When a cache line is copied from memory into the cache, a cache entry is created. The cache entry will include the copied data as well as the requested memory location (called a tag).

When the processor needs to read or write a location in memory, it first checks for a corresponding entry in the cache. The cache checks for the contents of the requested memory location in any cache lines that might contain that address. If the processor finds that the memory location is in the cache, a cache hit has occurred. However, if the processor does not find the memory location in the cache, a cache miss has occurred. In the case of a cache hit, the processor immediately reads or writes the data in the cache line. For a cache miss, the cache allocates a new entry and copies data from main memory, then the request is fulfilled from the contents of the cache.

Policies

Replacement policies

To make room for the new entry on a cache miss, the cache may have to evict one of the existing entries. The heuristic it uses to choose the entry to evict is called the replacement policy. The fundamental problem with any replacement policy is that it must predict which existing cache entry is least likely to be used in the future. Predicting the future is generally difficult, so there is no perfect method to choose among the variety of replacement policies available. One popular replacement policy, least-recently used (LRU), replaces the least recently accessed entry.

Marking some memory ranges as non-cacheable can improve performance, by avoiding caching of memory regions that are rarely re-accessed. This avoids the overhead of loading something into the cache without having any reuse. Cache entries may also be disabled or locked depending on the context.

Write policies

If data are written to the cache, at some point they must also be written to main memory; the timing of this write is known as the write policy. In a write-through cache, every write to the cache causes a write to main memory. Alternatively, in a write-back or copy-back cache, writes are not immediately mirrored to the main memory, with locations been written over being marked as dirty, being written back to the main memory only when they are evicted from the cache. For this reason, a read miss in a write-back cache may sometimes require two memory accesses to service: one to first write the dirty location to main memory, and then another to read the new location from memory. Also, a write to a main memory location that is not yet mapped in a write-back cache may evict an already dirty location, thereby freeing that cache space for the new memory location.

There are intermediate policies as well. The cache may be write-through, but the writes may be held in a store data queue temporarily, usually so multiple stores can be processed together (which can reduce bus turnarounds and improve bus utilization).

Cached data from the main memory may be changed by other entities (e.g., peripherals using direct memory access (DMA) or another core in a multi-core processor), in which case the copy in the cache may become out-of-date or stale. Alternatively, when a CPU in a multiprocessor system updates data in the cache, copies of data in caches associated with other CPUs become stale. Communication protocols between the cache managers that keep the data consistent are known as cache coherence protocols.

Cache performance

Cache performance measurement has become important in recent times where the speed gap between the memory performance and the processor performance is increasing exponentially. The cache was introduced to reduce this speed gap. Thus knowing how well the cache is able to bridge the gap in the speed of processor and memory becomes important, especially in high-performance systems. The cache hit rate and the cache miss rate play an important role in determining this performance. To improve the cache performance, reducing the miss rate becomes one of the necessary steps among other steps. Decreasing the access time to the cache also gives a boost to its performance and helps with optimization.

CPU stalls

The time taken to fetch one cache line from memory (read latency due to a cache miss) matters because the CPU will run out of work while waiting for the cache line. When a CPU reaches this state, it is called a stall. As CPUs become faster compared to main memory, stalls due to cache misses displace more potential computation; modern CPUs can execute hundreds of instructions in the time taken to fetch a single cache line from main memory.

Various techniques have been employed to keep the CPU busy during this time, including out-of-order execution in which the CPU attempts to execute independent instructions after the instruction that is waiting for the cache miss data. Another technology, used by many processors, is simultaneous multithreading (SMT), which allows an alternate thread to use the CPU core while the first thread waits for required CPU resources to become available.

Associativity

thumb|upright=1.7|An illustration of different ways in which memory locations can be cached by particular cache locations

The placement policy decides where in the cache a copy of a particular entry of main memory will go. If the placement policy is free to choose any entry in the cache to hold the copy, the cache is called fully associative. At the other extreme, if each entry in the main memory can go in just one place in the cache, the cache is direct-mapped. Many caches implement a compromise in which each entry in the main memory can go to any one of N places in the cache, and are described as N-way set associative. For example, the level-1 data cache in an AMD Athlon is two-way set associative, which means that any particular location in main memory can be cached in either of two locations in the level-1 data cache.

Choosing the right value of associativity involves a trade-off. If there are ten places to which the placement policy could have mapped a memory location, then to check if that location is in the cache, ten cache entries must be searched. Checking more places takes more power and chip area, and potentially more time. On the other hand, caches with more associativity suffer fewer misses (see conflict misses), so that the CPU wastes less time reading from the slow main memory. The general guideline is that doubling the associativity, from direct mapped to two-way, or from two-way to four-way, has about the same effect on raising the hit rate as doubling the cache size. However, increasing associativity more than four does not improve hit rate as much, and are generally done for other reasons (see virtual aliasing). Some CPUs can dynamically reduce the associativity of their caches in low-power states, which acts as a power-saving measure.

In order of worse but simple to better but complex:

Direct mapped cachegood best-case time, but unpredictable in the worst case
Two-way set associative cache
Two-way skewed associative cache
Four-way set-associative cache
Eight-way set-associative cache, a common choice for later implementations
12-way set associative cache, similar to eight-way
Fully associative cachethe best miss rates, but practical only for a small number of entries

Direct-mapped cache

In this cache organization, each location in the main memory can go in only one entry in the cache. Therefore, a direct-mapped cache can also be called a "one-way set associative" cache. It does not have a placement policy as such, since there is no choice of which cache entry's contents to evict. This means that if two locations map to the same entry, they may continually knock each other out. Although simpler, a direct-mapped cache needs to be much larger than an associative one to give comparable performance, and it is more unpredictable. Let be block number in cache, be block number of memory, and be number of blocks in cache, then mapping is done with the help of the equation .

Two-way set associative cache

If each location in the main memory can be cached in either of two locations in the cache, one logical question is: which one of the two? The simplest and most commonly used scheme, shown in the right-hand diagram above, is to use the least significant bits of the memory location's index as the index for the cache memory, and to have two entries for each index. One benefit of this scheme is that the tags stored in the cache do not have to include that part of the main memory address which is implied by the cache memory's index. Since the cache tags have fewer bits, they require fewer transistors, take less space on the processor circuit board or on the microprocessor chip, and can be read and compared faster. Also LRU algorithm is especially simple since only one bit needs to be stored for each pair.

Speculative execution

One of the advantages of a direct-mapped cache is that it allows simple and fast speculation. Once the address has been computed, the one cache index which might have a copy of that location in memory is known. That cache entry can be read, and the processor can continue to work with that data before it finishes checking that the tag actually matches the requested address.

The idea of having the processor use the cached data before the tag match completes can be applied to associative caches as well. A subset of the tag, called a hint, can be used to pick just one of the possible cache entries mapping to the requested address. The entry selected by the hint can then be used in parallel with checking the full tag. The hint technique works best when used in the context of address translation, as explained below.

Two-way skewed associative cache

Other schemes have been suggested, such as the skewed cache, Additionally, when it comes time to load a new line and evict an old line, it may be difficult to determine which existing line was least recently used, because the new line conflicts with data at different indexes in each way; LRU tracking for non-skewed caches is usually done on a per-set basis. Nevertheless, skewed-associative caches have major advantages over conventional set-associative ones.

Pseudo-associative cache

A true set-associative cache tests all the possible ways simultaneously, using something like a content-addressable memory. A pseudo-associative cache tests each possible way one at a time. A hash-rehash cache and a column-associative cache are examples of a pseudo-associative cache.

In the common case of finding a hit in the first way tested, a pseudo-associative cache is as fast as a direct-mapped cache, but it has a much lower conflict miss rate than a direct-mapped cache, closer to the miss rate of a fully associative cache. is to use the set index to map to a cache set as a conventional set associative cache does, and to use the added tag bits to index a way in the set. For example, in a 4-way set associative cache, the two bits are used to index way 00, way 01, way 10, and way 11, respectively. This double cache indexing is called a "major location mapping", and its latency is equivalent to a direct-mapped access. Extensive experiments in multicolumn cache design Intel's way-predicting cache memory, IBM's reconfigurable multi-way associative cache memory and Oracle's dynamic cache replacement way selection based on address tab bits.

Cache entry structure

Cache row entries usually have the following structure:

{| style="width:30%; text-align:center" border="1"

| tag || data block || flag bits

The data block (cache line) contains the actual data fetched from the main memory. The tag contains (part of) the address of the actual data fetched from the main memory. The flag bits are discussed below.

The "size" of the cache is the amount of main memory data it can hold. This size can be calculated as the number of bytes stored in each data block times the number of blocks stored in the cache. (The tag, flag and error correction code bits are not included in the size, although they do affect the physical area of a cache.)

An effective memory address which goes along with the cache line (memory block) is split (MSB to LSB) into the tag, the index and the block offset.

{| style="width:30%; text-align:center" border="1"

| tag || index || block offset

The index describes which cache set that the data has been put in. The index length is <math>\lceil \log_2(s) \rceil</math> bits for cache sets.

The block offset specifies the desired data within the stored data block within the cache row. Typically the effective address is in bytes, so the block offset length is <math>\lceil \log_2(b) \rceil</math> bits, where is the number of bytes per data block.

The tag contains the most significant bits of the address, which are checked against all rows in the current set (the set has been retrieved by index) to see if this set contains the requested address. If it does, a cache hit occurs. The tag length in bits is as follows:

::<code>tag_length = address_length - index_length - block_offset_length</code>

Some authors refer to the block offset as simply the "offset" or the "displacement".

Example

The original Pentium 4 processor had a four-way set associative L1 data cache of 8 KiB in size, with 64-byte cache blocks. Hence, there are 8 KiB / 64 = 128 cache blocks. The number of sets is equal to the number of cache blocks divided by the number of ways of associativity, what leads to 128 / 4 = 32 sets, and hence 2<sup>5</sup> = 32 different indices. There are 2<sup>6</sup> = 64 possible offsets. Since the CPU address is 32 bits wide, this implies 32 − 5 − 6 = 21 bits for the tag field.

The original Pentium 4 processor also had an eight-way set associative L2 integrated cache 256 KiB in size, with 128-byte cache blocks. This implies 32 − 8 − 7 = 17 bits for the tag field.