Changes

Thomas Gruber · c85bfea3
--- a/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
+++ b/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
 # L2/L3/MEM traffic on Intel Skylake SP/Cascadelake SP
-Intel introduced with Intel Skylake SP (and the successor Cascadelake SP) a change in the cache hierarchy. The sizes of each layer are changed (L2 larger, L3 smaller) and the L3 is now a victim cache. This results also in a different approach to measure the traffic between L2, L3 and memory.
+Intel introduced with Intel Skylake SP (and the successor Cascadelake SP) a change in the cache hierarchy. The sizes of each layer are changed (L2 larger, L3 smaller) and the L3 is now a victim cache (non-inclusive cache). This results also in a different approach to measure the traffic between L2, L3 and memory.
 ## What is a victim cache
-On all architectures before Intel Skylake SP, the caches are inclusive. This means that all cache lines that are currently in L1 are contained in L2 and L3 cache as well (same for all lines in L2 that are also present in L3). With Intel Skylake SP, the L3 cache became a victim cache (non-inclusive) while L1 and L2 continue being inclusive.
+On all architectures before Intel Skylake SP, the caches are (mostly?) inclusive. This means that all cache lines that are currently in L1 are contained in L2 and L3 cache as well (same for all lines in L2 that are also present in L3). With Intel Skylake SP, the L3 cache became a victim cache (non-inclusive) while L1 and L2 continue being inclusive.
-If a core requests data from memory, it is directly loaded into L2 (and then in L1) bypassing the L3 cache. If a cache lines need to be evicted from L2, the current line state is checked and, based on some heuristics, the cache line is either dropped (makes sense for clean cache lines), evicted to L3 (makes sense for modified and shared cache lines) or even evicted directly to memory. The exact heuristics are not published by Intel.
+If a core requests data from memory, it is directly loaded into L2 (and then in L1) bypassing the L3 cache (**). If a cache lines need to be evicted from L2, the current line state is checked and, based on some heuristics which includes probability of reuse and sharing between cores and chips:
+* cache line is dropped
+* cache line is evicted to L3 
+* cache line is evicted directly to memory
+The exact heuristics are not published by Intel.
+(**) Except the LLC prefetcher is active and pulls some cache lines from memory. Which it is on the Intel Skylake SP test system. But as we see later, the currently known events are not able to differentiate the L2 load traffic between L2 and either L3 or memory. The prefetcher accelerates the loading of data for streaming access, so we probably measure a higher load bandwidth due to the prefetcher but the analysis is based on data volume per iteration leaving out the factor time.
 ## What is the difference for measurements?
-For the CPU architectures before Intel Skylake SP, LIKWID uses two events for loaded (L2_LINES_IN_ALL, rf107) and evicted (L2_TRANS_L2_WB, r40f0) cache lines. This was enough to achieve a high accuracy because all data coming from memory and going to memory has to flow through L2 and L3. With Intel Skylake SP the situation changed and the event L2_LINES_IN_ALL is the sum of loads from L3 **and** memory (simply all cache lines coming into L2 independent of the source). The same is true for the L2_TRANS_L2_WB event. There is no differentiation between evicts to L3 and evicts to memory and also no event for counting dropped cache lines. 
+For the CPU architectures before Intel Skylake SP, LIKWID uses two events for loaded (L2_LINES_IN_ALL, rf107, (--)) and evicted (L2_TRANS_L2_WB, r40f0, (--)) cache lines. This was enough to achieve a high accuracy because all data coming from memory and going to memory has to flow through L2 and L3. With Intel Skylake SP the situation changed and the event L2_LINES_IN_ALL is the sum of loads from L3 **and** memory (simply all cache lines coming into L2 independent of the source). The same is true for the L2_TRANS_L2_WB event. There is no differentiation between evicts to L3 and evicts to memory and also no event for counting dropped cache lines. 
+(--) For the both architectures: Intel Broadwell EP and Intel Skylake SP. There are other usable events, like `L2_LINES_OUT_SILENT` (r01F2) and `L2_LINES_OUT_NON_SILENT` (r02F2) for Intel Skylake SP.
 <p align="center">
  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/cache_layers_bdx.png" alt="Cache layers of Intel Broadwell EP processors">
  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/cache_layers_skx.png" alt="Cache layers of Intel Skylake SP processors">
 </p>
-In order to show the difference at hardware level, I took some measurements (LIKWID version 4.3.4) with the current `L3` group using a single core on an Intel Broadwell EP (Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz, CPU clock fixed to 2.30 GHz) and an Intel Skylake SP (Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, CPU clock fixed to 2.40 GHz). As benchmark application I used `likwid-bench` with the benchmarks [`load`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/load.ptt), [`store`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/store.ptt), [`copy`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/copy.ptt), [`stream`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/stream.ptt) and [`triad`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/triad.ptt). All benchmarks work on double precision floating-point arrays using scalar operations.
+In order to show the difference at hardware level, I took some measurements (LIKWID version 4.3.4) with the current `L3` group ([BDX](https://github.com/RRZE-HPC/likwid/blob/4.3.0/groups/broadwellEP/L3.txt), [SKX](https://github.com/RRZE-HPC/likwid/blob/4.3.0/groups/skylakeX/L3.txt) using a single core on an Intel Broadwell EP (Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz, CPU clock fixed to 2.30 GHz) and an Intel Skylake SP (Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, CPU clock fixed to 2.40 GHz). As benchmark application I used `likwid-bench` with the benchmarks:
+* [`load`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/load.ptt): register = A[i]
+* [`store`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/store.ptt): A[i] = constant
+* [`copy`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/copy.ptt): A[i] = B[i]
+* [`stream`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/stream.ptt): A[i] = B[i] * c + C[i]
+* [`triad`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/triad.ptt): A[i] = B[i] * C[i] + D[i]
+All benchmarks work on double precision floating-point arrays using scalar operations.
 ### Comparison for load benchmark
 <p align="center">
@@ -24,16 +39,6 @@ In order to show the difference at hardware level, I took some measurements (LIK
  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_load.png" alt="Data volume per loop iteration of L3 and memory controller for the `load` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
 </p>
-For BDX, the behavior is as expected:
-* As soon as the data sizes fill the L2 cache almost completely, the cache lines loaded from L3 (`L2_LINES_IN_ALL*64`) raises until it reaches the 64 Byte per iteration.
-* The benchmark does not evict any data, hence the `L2_TRANS_L2_WB*64` event stays zero,
-* When the data sizes come closer to the L3 size (11 MByte as Cluster-on-Die is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte.
-For SKX, the behavior is different:
-* When the L2 cache is almost full, cache lines are loaded from L3 (`L2_LINES_IN_ALL*64`) until we read 64 Byte per iteration. Same as for BDX.
-* But also `L2_TRANS_L2_WB*64` raises. To the same extent as (`L2_LINES_IN_ALL*64`). This is because the cache lines are evicted to L3 and fetched again because the SKX L2 cache marks them as they are likely to be reused shortly.
-* No data is written back to memory, as `SUM(CAS_COUNT_WR)*64` stays zero.
-* When the data sizes come closer to the L3 size (full 28 MByte usable although SNC is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte.
@@ -41,9 +46,9 @@ For SKX, the behavior is different:
 | BDX       |   SKX     |
 |-----------|-----------|
 | As soon as the data sizes fill the L2 cache almost completely, the cache lines loaded from L3 (`L2_LINES_IN_ALL*64`) raises until it reaches the 64 Byte per iteration. | When the L2 cache is almost full, cache lines are loaded from L3 (`L2_LINES_IN_ALL*64`) until we read 64 Byte per iteration. Same as for BDX. |
-|The benchmark does not evict any data, hence the `L2_TRANS_L2_WB*64` event stays zero | `L2_TRANS_L2_WB*64` raises similar to (`L2_LINES_IN_ALL*64`). This is because the cache lines are evicted to L3 and fetched again because the SKX L2 cache marks them as they are likely to be reused shortly. |
+|The benchmark does not evict any data, hence the `L2_TRANS_L2_WB*64` event stays zero | `L2_TRANS_L2_WB*64` raises similar to (`L2_LINES_IN_ALL*64`). This is because the cache lines are evicted from L2 but they can be either evicted to L3, evicted to memory or being dropped. |
-| `SUM(CAS_COUNT_RD)*64` stays zero | `SUM(CAS_COUNT_RD)*64` stays zero |
+| When the data sizes come closer to the L3 size (11 MByte as Cluster-on-Die is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte | When the data sizes come closer to the L3 size (full 28 MByte usable although SNC is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte. |
-| When the data sizes come closer to the L3 size (11 MByte as Cluster-on-Die is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte | When the data sizes come closer to the L3 size (full 28 MByte usable although SNC is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte.
+| `SUM(CAS_COUNT_WR)*64` stays zero | `SUM(CAS_COUNT_WR)*64` stays zero |
 ### Comparison for store benchmark
@@ -52,12 +57,24 @@ For SKX, the behavior is different:
  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_store.png" alt="Data volume per loop iteration of L3 and memory controller for the `store` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
 </p>
+| BDX       |   SKX     |
+|-----------|-----------|
+| Excepted behavior starts loading 32B of data as soon as the vector size reaches the L2 size. These 32B are the read-for-ownership for the stores |  Excepted behavior starts loading 32B of data as soon as the vector size reaches the L2 size. These 32B are the read-for-ownership for the stores |
+| 32B are evicted from L2 to the L3 when the vector size is larger than L2 | 32B are evicted from L2 to L3 when the vector size is larger than L2. There are not writebacks to memory using about half of the L3 cache. Above half of L3, the cache lines are either evicted from L2 to L3 (which forwards some cache lines to the memory) or the memory directly. |
+| The memory loads of 32B are starting shortly after reaching L3 size, so data is streamed from memory through the L3 into L2 | Although the L3 size is not full reached, some loads (RFO traffic) are already handled by the memory. On the other side, it takes much larger sizes than L3 (100-200MB) until the full 32B are read from memory |
+| The memory stores of 32B behave similar to the memory loads, which is clear by a 1:1 ratio of reads and writes | The load and store memory traffic behaves similarly. It takes also larger sizes than L3 until the full 32B are written back to memory.
 ### Comparison for copy benchmark
 <p align="center">
  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_copy.png" alt="Data volume per loop iteration of L3 and memory controller for the `copy` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz">
  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_copy.png" alt="Data volume per loop iteration of L3 and memory controller for the `copy` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
 </p>
+| BDX       |   SKX     |
+|-----------|-----------|
+| When the L2 is completely used, the loads from L3 raise until they read 64B (32B application data + 32B read-for-ownership). | When the L2 is completely used, the loads from L3 raise and at around 50% of the L3 cache size the loads are coming from memory (either directly or through the L3 as the LLC prefetcher is active on this machine). |
+| BDX evicts | SKX evicts |
 ### Comparison for stream benchmark
 <p align="center">
  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_stream.png" alt="Data volume per loop iteration of L3 and memory controller for the `stream` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz">