Changes

Thomas Gruber · 8fec5922
--- a/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
+++ b/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
@@ -2,7 +2,7 @@

 Intel introduced with Intel Skylake SP (and the successor Cascadelake SP) a change in the cache hierarchy. The sizes of each layer are changed (L2 larger, L3 smaller) and the L3 is now a victim cache (non-inclusive cache). This results also in a different approach to measure the traffic between L2, L3 and memory.

-## What is a victim cache
+## What is a [victim cache](https://en.wikipedia.org/wiki/Victim_cache)
 On all architectures before Intel Skylake SP (SKX), like Intel Broadwell EP (BDX), the caches are (mostly?) inclusive. This means that all cache lines that are currently in L1 are contained in L2 and L3 cache as well (same for all lines in L2 that are also present in L3). With Intel Skylake SP, the L3 cache became a victim cache (non-inclusive) while L1 and L2 continue being inclusive.

 If a core requests data from memory, it is directly loaded into L2 (and then in L1) bypassing the L3 cache (**). If a cache lines need to be evicted from L2, the current line state is checked and, based on some heuristics which includes probability of reuse and sharing between cores and chips:
@@ -73,10 +73,10 @@ So let's check full measurements to show the difference between BDX and SKX.

 | BDX       |   SKX     |
 |-----------|-----------|
-| As soon as the data sizes fill the L2 cache almost completely, the bytes loaded from L3 (`L2_LINES_IN_ALL*64`) raises until it reaches the 64 Byte per iteration. | When the L2 cache is almost full, the data is loaded from L3 (`L2_LINES_IN_ALL*64`) until we read 64 Byte per iteration. Same as for BDX. |
-|The benchmark does not evict any data, hence the `L2_TRANS_L2_WB*64` event stays zero | `L2_TRANS_L2_WB*64` raises similar to (`L2_LINES_IN_ALL*64`). This is because the cache lines are evicted from L2 but they can be either evicted to L3, evicted to memory or being dropped. |
-| When the data sizes come closer to the L3 size (11 MByte as Cluster-on-Die is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte | When the data sizes come closer to the L3 size (full 28 MByte usable although SNC is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte. |
-| `SUM(CAS_COUNT_WR)*64` stays zero | `SUM(CAS_COUNT_WR)*64` stays zero |
+| As soon as the data sizes fill the L2 cache almost completely, the bytes loaded from L3 (`L2_LINES_IN_ALLr`) raises until it reaches the 64 Byte per iteration. | When the L2 cache is almost full, the data is loaded from L3 (`L2_LINES_IN_ALL`) until we read 64 Byte per iteration. Same as for BDX. |
+|The benchmark does not evict any data, hence the `L2_TRANS_L2_WB` event stays zero | `L2_TRANS_L2_WB` raises similar to (`L2_LINES_IN_ALL`). This is because the cache lines are evicted from L2 but they can be either evicted to L3, evicted to memory or being dropped. |
+| When the data sizes come closer to the L3 size (11 MByte as Cluster-on-Die is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)` increases until 64 Byte | When the data sizes come closer to the L3 size (full 28 MByte usable although SNC is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)` increases until 64 Byte. |
+| `SUM(CAS_COUNT_WR)` stays zero | `SUM(CAS_COUNT_WR)` stays zero |

 The problem gets visible here already. Since the L2 cache lines are not commonly contained in the L3, the cache lines need to be moved from L2 to L3 which doubles (for the load benchmark) the measured writeback data volume. Moreover, due to the behavior of the event `L2_TRANS_L2_WB` counting anything that is being written back passing L2, it is unclear what happens with the cache lines: evict to L3, evict to memory or dropping.

@@ -100,17 +100,17 @@ The memory traffic can be measured properly and with high accuracy assuming 64B
 It is probably possible to use L3 events (part of the Uncore) to retrieve the counts for data flowing into L3, data being loaded by the L2 and the evictions to memory. But the Uncore is socket-specific and consequently does not allow the attribution of a single cores' data consumption. Furthermore, there are quite a few LLC units (CBOXes) and each unit has to be programmed and read.

 After a meeting with Intel, we got a list of events:
-* MEM_INST_RETIRED.ALL_LOADS
-* MEM_INST_RETIRED.ALL_STORES
-* MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT
-* MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM
-* MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS
-* MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE
-* MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM
-* MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM
+* MEM_INST_RETIRED.ALL_LOADS (r81d0)
+* MEM_INST_RETIRED.ALL_STORES (r82d0)
+* MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT (r02d2)
+* MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM (r04d2)
+* MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS (r01d2)
+* MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE (r08d2)
+* MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM (r01d3)
+* MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM (r02d3)
 * MEM_LOAD_L3_MISS_RETIRED.REMOTE_L4 (**)
 * MEM_LOAD_MISC_RETIRED.NON_DRAM (**)
-* MEM_LOAD_MISC_RETIRED.UC
+* MEM_LOAD_MISC_RETIRED.UC (r04d4)
 * MEM_LOAD_MISC_RETIRED.UNKNOWN_SOURCE (**)

 All events marked with (**) are not published and consequently not usable by LIKWID. We tried the other events but for some it was clear that it wouldn't work. E.g. the `MEM_INST_RETIRED.ALL_*` events count the number of loads resp. stores that are issued, executed and retired (completed) by the core, hence some units away from L2, L3 and memory. Moreover, there are cases where an instruction triggers data movement in the background (e.g. read-for-ownership for stores where the destination cache line is not present in the L1 cache) which are not covered by these two events.
@@ -175,4 +175,6 @@ The results show:
 2. The L2 writeback path can be characterized. No information about L3 writebacks.
 3. No information about the load path (##)

+So, if we leave out the `L2_TRANS_L2_WB` from the `L3` performance group, we can include both `IDI_MISC_WB*` events and still have one counter register left. In this counter we could measure the L3 hits to characterize the load path. Unfortunately, the `MEM_LOAD_L3_*` are likely to be listed in the specification updates. This is not the case for Intel Skylake SP but for Intel Skylake Desktop.
+
 (##) Commonly, all data should be loaded from memory directly to L2 except the LLC prefetcher is active (like in this case). One might assume that all cache lines evicted to L3 for re-use are also loaded again from L3 but that would mean that the heuristics are always the optimal decision.
\ No newline at end of file