Changes

Thomas Gruber · 4e60e3b8
--- a/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
+++ b/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
@@ -16,6 +16,60 @@ For the CPU architectures before Intel Skylake SP, LIKWID uses two events for lo
  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/cache_layers_skx.png" alt="Cache layers of Intel Skylake SP processors">
 </p>
+In order to show the difference at hardware level, I took some measurements (LIKWID version 4.3.4) with the current `L3` group using a single core on an Intel Broadwell EP (Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz, CPU clock fixed to 2.30 GHz) and an Intel Skylake SP (Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, CPU clock fixed to 2.40 GHz). As benchmark application I used `likwid-bench` with the benchmarks [`load`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/load.ptt), [`store`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/store.ptt), [`copy`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/copy.ptt), [`stream`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/stream.ptt) and [`triad`](https://github.com/RRZE-HPC/likwid/blob/4.3.0/bench/x86-64/triad.ptt). All benchmarks work on double precision floating-point arrays using scalar operations.
+### Comparison for load benchmark
+<p align="center">
+  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_load.png" alt="Data volume per loop iteration of L3 and memory controller for the `load` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz">
+  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_load.png" alt="Data volume per loop iteration of L3 and memory controller for the `load` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
+</p>
+For BDX, the behavior is as expected:
+* As soon as the data sizes fill the L2 cache almost completely, the cache lines loaded from L3 (`L2_LINES_IN_ALL*64`) raises until it reaches the 64 Byte per iteration.
+* The benchmark does not evict any data, hence the `L2_TRANS_L2_WB*64` event stays zero,
+* When the data sizes come closer to the L3 size (11 MByte as Cluster-on-Die is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte.
+For SKX, the behavior is different:
+* When the L2 cache is almost full, cache lines are loaded from L3 (`L2_LINES_IN_ALL*64`) until we read 64 Byte per iteration. Same as for BDX.
+* But also `L2_TRANS_L2_WB*64` raises. To the same extent as (`L2_LINES_IN_ALL*64`). This is because the cache lines are evicted to L3 and fetched again because the SKX L2 cache marks them as they are likely to be reused shortly.
+* No data is written back to memory, as `SUM(CAS_COUNT_WR)*64` stays zero.
+* When the data sizes come closer to the L3 size (full 28 MByte usable although SNC is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte.
+| BDX       |   SKX     |
+|-----------|-----------|
+| As soon as the data sizes fill the L2 cache almost completely, the cache lines loaded from L3 (`L2_LINES_IN_ALL*64`) raises until it reaches the 64 Byte per iteration. | When the L2 cache is almost full, cache lines are loaded from L3 (`L2_LINES_IN_ALL*64`) until we read 64 Byte per iteration. Same as for BDX. |
+|The benchmark does not evict any data, hence the `L2_TRANS_L2_WB*64` event stays zero | `L2_TRANS_L2_WB*64` raises similar to (`L2_LINES_IN_ALL*64`). This is because the cache lines are evicted to L3 and fetched again because the SKX L2 cache marks them as they are likely to be reused shortly. |
+| `SUM(CAS_COUNT_RD)*64` stays zero | `SUM(CAS_COUNT_RD)*64` stays zero |
+| When the data sizes come closer to the L3 size (11 MByte as Cluster-on-Die is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte | When the data sizes come closer to the L3 size (full 28 MByte usable although SNC is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)*64` increases until 64 Byte.
+### Comparison for store benchmark
+<p align="center">
+  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_store.png" alt="Data volume per loop iteration of L3 and memory controller for the `store` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz">
+  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_store.png" alt="Data volume per loop iteration of L3 and memory controller for the `store` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
+</p>
+### Comparison for copy benchmark
+<p align="center">
+  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_copy.png" alt="Data volume per loop iteration of L3 and memory controller for the `copy` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz">
+  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_copy.png" alt="Data volume per loop iteration of L3 and memory controller for the `copy` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
+</p>
+### Comparison for stream benchmark
+<p align="center">
+  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_stream.png" alt="Data volume per loop iteration of L3 and memory controller for the `stream` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz">
+  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_stream.png" alt="Data volume per loop iteration of L3 and memory controller for the `stream` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
+</p>
+### Comparison for triad benchmark
+<p align="center">
+  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz">
+  <img width="49%" src="https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_triad.png" alt="Data volume per loop iteration of L3 and memory controller for the `triad` benchmark on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz">
+</p>
 ## What is the current state?
 I posted the question in the [Intel developer forums](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/761848) and it's not only me/LIKWID having that problem. At the current state, the events for traffic in and out of L2 do not allow a differentiation of the source resp. destination.
@@ -37,12 +91,19 @@ In a meeting with Intel, we got a list of events:
 * MEM_LOAD_MISC_RETIRED.UC
 * MEM_LOAD_MISC_RETIRED.UNKNOWN_SOURCE (**)
-All events marked with (**) are not published and consequently not usable by LIKWID. We tried the other events but for some it was clear that it wouldn't work. E.g. the MEM_INST_RETIRED.ALL_* events count the number of loads resp. stores that are issued, executed and retired (completed) by the core, hence some units away from L2, L3 and memory. Moreover, there are cases where an instruction triggers data movement in the background (e.g. read-for-ownership for stores where the destination cache line is not present in the L1 cache)
+All events marked with (**) are not published and consequently not usable by LIKWID. We tried the other events but for some it was clear that it wouldn't work. E.g. the MEM_INST_RETIRED.ALL_* events count the number of loads resp. stores that are issued, executed and retired (completed) by the core, hence some units away from L2, L3 and memory. Moreover, there are cases where an instruction triggers data movement in the background (e.g. read-for-ownership for stores where the destination cache line is not present in the L1 cache).
 ## Implications on the use of the L3 performance group for Intel Skylake
 The L3 performance group for Intel Skylake still uses the two events mentioned above. So, keep in mind that L2_LINES_IN_ALL contains loads from L3 and memory and L2_TRANS_L2_WB contains writebacks to L3 (and memory). 
 ## Changes with Cascadelake SP/AP
-When releasing the Intel Cascadelake SP/AP chips, Intel published two new events IDI_MISC.WB_UPGRADE (Counts number of cache lines that are allocated and written back to L3 with the intention that they are more likely to be reused shortly) and IDI_MISC.WB_DOWNGRADE (Counts number of cache lines that are dropped and not written back to L3 as they are deemed to be less likely to be reused shortly). Whole list of Cascadelake SP/AP events [here](https://download.01.org/perfmon/CLX/cascadelakex_core_v1.04.json) 
+When releasing the Intel Cascadelake SP/AP chips, Intel published two new events IDI_MISC.WB_UPGRADE (Counts number of cache lines that are allocated and written back to L3 with the intention that they are more likely to be reused shortly) and IDI_MISC.WB_DOWNGRADE (Counts number of cache lines that are dropped and not written back to L3 as they are deemed to be less likely to be reused shortly). Whole list of Cascadelake SP/AP events [here](https://download.01.org/perfmon/CLX/cascadelakex_core_v1.04.json). One of the problems is, that these events are already mentioned in the errata section of the [specification update document for Intel Cascadelake SP/AP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/2nd-gen-xeon-scalable-spec-update.pdf):
+> CLX3. IDI_MISC Performance Monitoring Events May be Inaccurate<br>
+> Problem: The IDI_MISC.WB_UPGRADE and IDI_MISC.WB_DOWNGRADE performance monitoring events (Event FEH; UMask 02H and 04H) counts cache lines evicted fromthe L2 cache. Due to this erratum, the per logical processor count may be incorrect when both logical processors on the same physical core are active. The aggregate count of both logical processors is not affected by this erratum.<br>
+> Implication: IDI_MISC performance monitoring events may be inaccurate.<br>
+> Workaround: None identified.<br>
+> Status:No fix.<br>
-As an experiment I added the two events to the Intel Skylake SP event file and ran some benchmarks.
+The Intel Cascadelake SP micro architecture is the follow-up to Intel Skylake SP. As an experiment I added the two events to the Intel Skylake SP event file and ran some benchmarks using a single core to get accurate results. Same configuration as in the above comparison between Intel Broadwell EP and Intel Skylake SP.