Changes

Thomas Gruber · 256b7e3c
--- a/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
+++ b/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
@@ -3,7 +3,7 @@
 Intel introduced with Intel Skylake SP (and the successor Cascadelake SP) a change in the cache hierarchy. The sizes of each layer are changed (L2 larger, L3 smaller) and the L3 is now a victim cache (non-inclusive cache). This results also in a different approach to measure the traffic between L2, L3 and memory.

 ## What is a victim cache
-On all architectures before Intel Skylake SP, the caches are (mostly?) inclusive. This means that all cache lines that are currently in L1 are contained in L2 and L3 cache as well (same for all lines in L2 that are also present in L3). With Intel Skylake SP, the L3 cache became a victim cache (non-inclusive) while L1 and L2 continue being inclusive.
+On all architectures before Intel Skylake SP (SKX), like Intel Broadwell EP (BDX), the caches are (mostly?) inclusive. This means that all cache lines that are currently in L1 are contained in L2 and L3 cache as well (same for all lines in L2 that are also present in L3). With Intel Skylake SP, the L3 cache became a victim cache (non-inclusive) while L1 and L2 continue being inclusive.

 If a core requests data from memory, it is directly loaded into L2 (and then in L1) bypassing the L3 cache (**). If a cache lines need to be evicted from L2, the current line state is checked and, based on some heuristics which includes probability of reuse and sharing between cores and chips:
 * cache line is dropped
@@ -12,19 +12,35 @@ If a core requests data from memory, it is directly loaded into L2 (and then in

 The exact heuristics are not published by Intel.

+<p align="center">
+  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/cache_layers_bdx.png" alt="Cache layers of Intel Broadwell EP processors">
+  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/cache_layers_skx.png" alt="Cache layers of Intel Skylake SP processors">
+</p>
+
 (**) Except the LLC prefetcher is active and pulls some cache lines from memory. Which it is on the Intel Skylake SP test system. But as we see later, the currently known events are not able to differentiate the L2 load traffic between L2 and either L3 or memory. The prefetcher accelerates the loading of data for streaming access, so we probably measure a higher load bandwidth due to the prefetcher but the analysis is based on data volume per iteration leaving out the factor time.

 ## What is the difference for measurements?

-For the CPU architectures before Intel Skylake SP, LIKWID uses two events for loaded (L2_LINES_IN_ALL, rf107, (--)) and evicted (L2_TRANS_L2_WB, r40f0, (--)) cache lines. This was enough to achieve a high accuracy because all data coming from memory and going to memory has to flow through L2 and L3. With Intel Skylake SP the situation changed and the event L2_LINES_IN_ALL is the sum of loads from L3 **and** memory (simply all cache lines coming into L2 independent of the source). The same is true for the L2_TRANS_L2_WB event. There is no differentiation between evicts to L3 and evicts to memory and also no event for counting dropped cache lines. 
+For the CPU architectures before Intel Skylake SP, LIKWID uses two events for loaded (`L2_LINES_IN_ALL`, rf107, (--)) and evicted (`L2_TRANS_L2_WB`, r40f0, (++)) cache lines. This was enough to achieve a high accuracy because all data coming from memory and going to memory has to flow through L2 and L3. With Intel Skylake SP the situation changed and the event `L2_LINES_IN_ALL` is the sum of loads from L3 **and** memory (simply all cache lines coming into L2 independent of the source). The same behavior is true for the `L2_TRANS_L2_WB` event. There is no differentiation between evicts to L3 and evicts to memory and also no event for counting dropped cache lines, the event simply counts all cache lines written back by L2 which touch the L2. 

-(--) For the both architectures: Intel Broadwell EP and Intel Skylake SP. There are other usable events, like `L2_LINES_OUT_SILENT` (r01F2) and `L2_LINES_OUT_NON_SILENT` (r02F2) for Intel Skylake SP.
-<p align="center">
-  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/cache_layers_bdx.png" alt="Cache layers of Intel Broadwell EP processors">
-  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/cache_layers_skx.png" alt="Cache layers of Intel Skylake SP processors">
-</p>
+(--, ++) For the both architectures: Intel Broadwell EP and Intel Skylake SP.
+
+(++) There are other usable events, like `L2_LINES_OUT_SILENT` (r01F2) and `L2_LINES_OUT_NON_SILENT` (r02F2) for Intel Skylake SP.

-In order to show the difference at hardware level, I took some measurements (LIKWID version 4.3.4) with the current `L3` group ([BDX](https://github.com/RRZE-HPC/likwid/blob/4.3.0/groups/broadwellEP/L3.txt), [SKX](https://github.com/RRZE-HPC/likwid/blob/4.3.0/groups/skylakeX/L3.txt) using a single core on an Intel Broadwell EP (Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz, CPU clock fixed to 2.30 GHz) and an Intel Skylake SP (Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, CPU clock fixed to 2.40 GHz). As benchmark application I used `likwid-bench` with the benchmarks:
+
+| Event | BDX | SKX |
+|-------|-----|-----|
+|L2_LINES_IN_ALL | This event counts the number of L2 cache lines filling the L2. Counting does not cover rejects.|Counts the number of L2 cache lines filling the L2. Counting does not cover rejects. |
+|L2_TRANS_L2_WB|This event counts L2 writebacks that access L2 cache.|Counts L2 writebacks that access L2 cache.|
+|L2_LINES_OUT_SILENT| - | Counts the number of lines that are silently dropped by L2 cache when triggered by an L2 cache fill. These lines are typically in Shared or Exclusive state. A non-threaded event. |
+|L2_LINES_OUT_NON_SILENT| - |Counts the number of lines that are evicted by L2 cache when triggered by an L2 cache fill. Those lines can be either in modified state or clean state. Modified lines may either be written back to L3 or directly written to memory and not allocated in L3.  Clean lines may either be allocated in L3 or dropped|
+
+Source: [BDX](https://download.01.org/perfmon/BDX/broadwellx_core_v14.json), [SKX](https://download.01.org/perfmon/SKX/skylakex_core_v1.12.json)
+
+
+
+
+In order to show the difference at hardware level, I took some measurements (LIKWID version 4.3.4) with the current `L3` group ([BDX](https://github.com/RRZE-HPC/likwid/blob/4.3.0/groups/broadwellEP/L3.txt), [SKX](https://github.com/RRZE-HPC/likwid/blob/4.3.0/groups/skylakeX/L3.txt)) using a single core on an Intel Broadwell EP (Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz, CPU clock fixed to 2.30 GHz) and an Intel Skylake SP (Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, CPU clock fixed to 2.40 GHz). As benchmark application I used `likwid-bench` with the benchmarks:
 * [`load`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/load.ptt): register = A[i]
 * [`store`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/store.ptt): A[i] = constant
 * [`copy`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/copy.ptt): A[i] = B[i]
@@ -33,6 +49,22 @@ In order to show the difference at hardware level, I took some measurements (LIK

 All benchmarks work on double precision floating-point arrays using scalar operations.

+I mentioned that there are different events to measure the evict/writeback traffic from L2, so we test first whether our selection, the `L2_TRANS_L2_WB` event, provides better/worse results compared to the `L2_LINES_OUT_*` events.
+<p align="center">
+  <img src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3_COMPARE_load.png" alt="Comparison of data volume per loop iteration using the events `L2_LINES_OUT_SILENT`, `L2_LINES_OUT_NON_SILENT` and `L2_TRANS_L2_WB` running the `load` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz"><br>
+</p>
+
+For visibility, the remaining benchmarks are not directly shown but here are the links:
+
+* [store](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3_COMPARE_store.png)
+* [copy](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3_COMPARE_copy.png)
+* [stream](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3_COMPARE_stream.png)
+* [triad](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3_COMPARE_triad.png)
+
+These plots clearly show no real difference in the measurements of `L2_LINES_OUT_NON_SILENT` and `L2_TRANS_L2_WB` under these workloads. The `L2_LINES_OUT_SILENT` event stays at zero. There might be workloads that produce different pictures. The measurements are performed in two application runs, first the `L2_TRANS_L2_WB` and second all `L2_LINES_OUT_*` events.
+
+So let's check full measurements to show the difference between BDX and SKX.
+
 ### Comparison for load benchmark
 <p align="center">
  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_load.png" alt="Data volume per loop iteration of L3 and memory controller for the `load` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz">
@@ -45,48 +77,21 @@ All benchmarks work on double precision floating-point arrays using scalar opera

 | BDX       |   SKX     |
 |-----------|-----------|
-| As soon as the data sizes fill the L2 cache almost completely, the cache lines loaded from L3 (`L2_LINES_IN_ALL*64`) raises until it reaches the 64 Byte per iteration. | When the L2 cache is almost full, cache lines are loaded from L3 (`L2_LINES_IN_ALL*64`) until we read 64 Byte per iteration. Same as for BDX. |
+| As soon as the data sizes fill the L2 cache almost completely, the bytes loaded from L3 (`L2_LINES_IN_ALL*64`) raises until it reaches the 64 Byte per iteration. | When the L2 cache is almost full, the data is loaded from L3 (`L2_LINES_IN_ALL*64`) until we read 64 Byte per iteration. Same as for BDX. |
 |The benchmark does not evict any data, hence the `L2_TRANS_L2_WB*64` event stays zero | `L2_TRANS_L2_WB*64` raises similar to (`L2_LINES_IN_ALL*64`). This is because the cache lines are evicted from L2 but they can be either evicted to L3, evicted to memory or being dropped. |
 | When the data sizes come closer to the L3 size (11 MByte as Cluster-on-Die is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte | When the data sizes come closer to the L3 size (full 28 MByte usable although SNC is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte. |
 | `SUM(CAS_COUNT_WR)*64` stays zero | `SUM(CAS_COUNT_WR)*64` stays zero |

 The problem gets visible here already. Since the L2 cache lines are not commonly contained in the L3, the cache lines need to be moved from L2 to L3 which doubles (for the load benchmark) the measured writeback data volume. Moreover, due to the behavior of the event `L2_TRANS_L2_WB` counting anything that is being written back passing L2, it is unclear what happens with the cache lines: evict to L3, evict to memory or dropping.

-### Comparison for store benchmark
-<p align="center">
-  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_store.png" alt="Data volume per loop iteration of L3 and memory controller for the `store` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz">
-  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_store.png" alt="Data volume per loop iteration of L3 and memory controller for the `store` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
-</p>
-
-| BDX       |   SKX     |
-|-----------|-----------|
-| Excepted behavior starts loading 32B of data as soon as the vector size reaches the L2 size. These 32B are the read-for-ownership for the stores |  Excepted behavior starts loading 32B of data as soon as the vector size reaches the L2 size. These 32B are the read-for-ownership for the stores |
-| 32B are evicted from L2 to the L3 when the vector size is larger than L2 | 32B are evicted from L2 when the vector size is larger than L2 but until half of the cache it is unclear whether to L3 or memory or dropped. Above half of L3, the cache lines are either evicted from L2 to L3 (which forwards some cache lines to the memory) or the memory directly. This leads to the total "traffic" is |
-| The memory loads of 32B are starting shortly after reaching L3 size, so data is streamed from memory through the L3 into L2 | Although the L3 size is not full reached, some loads (RFO traffic) are already handled by the memory. On the other side, it takes much larger sizes than L3 (100-200MB) until the full 32B are read from memory |
-| The memory stores of 32B behave similar to the memory loads, which is clear by a 1:1 ratio of reads and writes | The load and store memory traffic behaves similarly. It takes also larger sizes than L3 until the full 32B are written back to memory.
+For completeness, here are the plots for the other benchmarks:

-### Comparison for copy benchmark
-<p align="center">
-  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_copy.png" alt="Data volume per loop iteration of L3 and memory controller for the `copy` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz">
-  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_copy.png" alt="Data volume per loop iteration of L3 and memory controller for the `copy` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
-</p>
-
-| BDX       |   SKX     |
-|-----------|-----------|
-| When the L2 is completely used, the loads from L3 raise until they read 64B (32B application data + 32B read-for-ownership). | When the L2 is completely used, the loads from L3 raise and at around 50% of the L3 cache size the loads are coming from memory (either directly or through the L3 as the LLC prefetcher is active on this machine). |
-| BDX evicts | SKX evicts |
-
-### Comparison for stream benchmark
-<p align="center">
-  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_stream.png" alt="Data volume per loop iteration of L3 and memory controller for the `stream` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz">
-  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_stream.png" alt="Data volume per loop iteration of L3 and memory controller for the `stream` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
-</p>
-
-### Comparison for triad benchmark
-<p align="center">
-  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz">
-  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
-</p>
+| Benchmark | BDX | SKX |
+|-----------|-----|-----|
+| store | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_store.png) | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_store.png) |
+| copy | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_copy.png) | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_copy.png) |
+| stream | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_stream.png) | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_stream.png) |
+| triad | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_triad.png) | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_triad.png) |

 ## Implications on the use of the L3 performance group for Intel Skylake
 The L3 performance group for Intel Skylake still uses the two events mentioned above. So, keep in mind that L2_LINES_IN_ALL contains loads from L3 and memory and L2_TRANS_L2_WB contains writebacks to L3 (and memory).