Changes

Thomas Gruber · 8c340d72
--- a/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
+++ b/L2-L3-MEM-traffic-on-Intel-Skylake-SP-CascadeLake-SP.md
@@ -78,7 +78,7 @@ So let's check full measurements to show the difference between BDX and SKX.
 | When the data sizes come closer to the L3 size (11 MByte as Cluster-on-Die is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)` increases until 64 Byte | When the data sizes come closer to the L3 size (full 28 MByte usable although SNC is enabled), the data fetched from memory `SUM(CAS_COUNT_RD)` increases until 64 Byte. |
 | `SUM(CAS_COUNT_WR)` stays zero | `SUM(CAS_COUNT_WR)` stays zero |

-The problem gets visible here already. Since the L2 cache lines are not commonly contained in the L3, the cache lines need to be moved from L2 to L3 which doubles (for the load benchmark) the measured writeback data volume. Moreover, due to the behavior of the event `L2_TRANS_L2_WB` counting anything that is being written back passing L2, it is unclear what happens with the cache lines: evict to L3, evict to memory or dropping.
+The problem gets visible here already. Since the L2 cache lines are not commonly contained in the L3, the cache lines need to be moved from L2 to L3 which increase the measured writeback data volume although no data is written by the benchmark. Moreover, due to the behavior of the event `L2_TRANS_L2_WB` counting anything that is being written back passing L2, it is unclear what happens with the cache lines: evict to L3, evict to memory or dropping.

 For completeness, here are the plots for the other benchmarks:

@@ -90,14 +90,14 @@ For completeness, here are the plots for the other benchmarks:
 | triad | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/BDX_L3NEW_triad.png) | [link](https://raw.githubusercontent.com/wiki/RRZE-HPC/likwid/images/skx_caches/SKX_L3MEM_triad.png) |

 ## Implications on the use of the L3 performance group for Intel Skylake
-The L3 performance group for Intel Skylake still uses the two events mentioned above. So, keep in mind that L2_LINES_IN_ALL contains loads from L3 and memory and L2_TRANS_L2_WB contains writebacks to L3 (and memory). 
+The L3 performance group for Intel Skylake (version 4.3.4) uses the two events mentioned above. So, keep in mind that L2_LINES_IN_ALL contains loads from L3 and memory and L2_TRANS_L2_WB contains writebacks to L3 (and memory) as well as dropped cache lines. 

 ## What was done to fix the problem?
 I posted the question in the [Intel developer forums](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/761848) and it's not only me/LIKWID having that problem. Others have also tried different events (also in the LLC units, CBOX). At the current state, the events for traffic in and out of L2 do not allow a differentiation of the source resp. destination.

 The memory traffic can be measured properly and with high accuracy assuming 64B for each read and write operation to memory. But the memory controllers are located in the Uncore part of the CPU and thus the counts reflect the traffic to/from all cores of a socket (+ intersocket traffic)

-It is probably possible to use L3 events (part of the Uncore) to retrieve the counts for data flowing into L3, data being loaded by the L2 and the evictions to memory. But the Uncore is socket-specific and consequently does not allow the attribution of a single cores' data consumption. Furthermore, there are quite a few LLC units (CBOXes) and each unit has to be programmed and read.
+It is probably possible to use L3 events (part of the Uncore) to retrieve the counts for data flowing into L3, data being loaded by the L2 and the evictions to memory. But the Uncore is socket-specific and consequently does not allow the attribution of a single cores' data consumption. Furthermore, there are quite a few LLC units (CBOXes, BDX has 24 and SKX 28) and each unit with probably multiple counters per unit have to be programmed and read.

 After a meeting with Intel, we got a list of events:
 * MEM_INST_RETIRED.ALL_LOADS (r81d0)
@@ -141,7 +141,7 @@ These events don't provide any further insight. The counts raise for some benchm


 ## Changes with Cascadelake SP/AP
-When releasing the Intel Cascadelake SP/AP chips, Intel published two new events IDI_MISC.WB_UPGRADE (Description: Counts number of cache lines that are allocated and written back to L3 with the intention that they are more likely to be reused shortly) and IDI_MISC.WB_DOWNGRADE (Description: Counts number of cache lines that are dropped and not written back to L3 as they are deemed to be less likely to be reused shortly). Whole list of Cascadelake SP/AP events [here](https://download.01.org/perfmon/CLX/cascadelakex_core_v1.04.json). One of the problems is, that these events are already mentioned in the errata section of the [specification update document for Intel Cascadelake SP/AP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/2nd-gen-xeon-scalable-spec-update.pdf):
+When releasing the Intel Cascadelake SP/AP chips, Intel published two new events `IDI_MISC.WB_UPGRADE` (Description: Counts number of cache lines that are allocated and written back to L3 with the intention that they are more likely to be reused shortly) and `IDI_MISC.WB_DOWNGRADE` (Description: Counts number of cache lines that are dropped and not written back to L3 as they are deemed to be less likely to be reused shortly). Whole list of Cascadelake SP/AP events [here](https://download.01.org/perfmon/CLX/cascadelakex_core_v1.04.json). One of the problems is, that these events are already mentioned in the errata section of the [specification update document for Intel Cascadelake SP/AP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/2nd-gen-xeon-scalable-spec-update.pdf):
 > CLX3. IDI_MISC Performance Monitoring Events May be Inaccurate<br>
 > Problem: The IDI_MISC.WB_UPGRADE and IDI_MISC.WB_DOWNGRADE performance monitoring events (Event FEH; UMask 02H and 04H) counts cache lines evicted from the L2 cache. Due to this erratum, the per logical processor count may be incorrect when both logical processors on the same physical core are active. The aggregate count of both logical processors is not affected by this erratum.<br>
 > Implication: IDI_MISC performance monitoring events may be inaccurate.<br>
@@ -167,7 +167,7 @@ In fact, Intel published these events for Intel Skylake SP already a long time a

 The doubled data volume compared to Intel Broadwell EP is still there but that's expected because the active copy to L3 is required to benefit from the L3 cache. The event `IDI_MISC_WB_UPGRADE` raises similar to the `L2_TRANS_L2_WB` event when sizes reach full L2 size. At about 75% of the L3 size, the evictions with reuse hint to L3 (`IDI_MISC_WB_UPGRADE`) decrease and cache lines drops raise (`IDI_MISC_WB_DOWNGRADE`). Similar to the drops, the reads from memory increase as, after dropping, the L2 has to re-read the cache lines from memory.

-So for accurate measurements of the writeback path, you need to measure `IDI_MISC_WB_UPGRADE` and `IDI_MISC_WB_DOWNGRADE` instead of `L2_TRANS_L2_WB`. 
+So for accurate measurements of the writeback path, you need to measure `IDI_MISC_WB_UPGRADE` and `IDI_MISC_WB_DOWNGRADE` besides `L2_TRANS_L2_WB`. 

 The results show:

@@ -175,8 +175,22 @@ The results show:
 2. The L2 writeback path can be characterized. No information about L3 writebacks.
 3. No information about the load path (##)

-So, if we leave out the `L2_TRANS_L2_WB` from the `L3` performance group, we can include both `IDI_MISC_WB*` events and still have one counter register left. In this counter we could measure the L3 hits to characterize the load path. Unfortunately, the `MEM_LOAD_L3_*` are likely to be listed in the specification updates. This is not the case for [Intel Skylake SP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) but for [Intel Skylake Desktop (SKL128)](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf).
-
+So, for better results from the `L3` performance group, we should include both `IDI_MISC_WB*` events. In this counter we could measure the L3 hits to characterize the load path. Unfortunately, the `MEM_LOAD_L3_*` are likely to be listed in the specification updates. This is not the case for [Intel Skylake SP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) but for [Intel Skylake Desktop (SKL128)](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf) and many previous CPU generations ([BDX](https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v4-spec-update.html), [HSX](https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-family-spec-update.html), [SNB-EP](https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-family-spec-update.html)).


 (##) Commonly, all data should be loaded from memory directly to L2 except the LLC prefetcher is active (like in this case). One might assume that all cache lines evicted to L3 for re-use are also loaded again from L3 but that would mean that the heuristics are always the optimal decision.
+
+# Results
+For the version 5.0.0, the L3 group for Intel Skylake SP and Intel Cascadelake SP was updated to include the IDI_MISC_WB_*`. Starting with this version, you get more information and derived metrics from the L3 performance group:
+
+```
+L3 evict bandwidth [MBytes/s] = 1.0E-06*IDI_MISC_WB_UPGRADE*64.0/time
+L3 evict data volume [GBytes] = 1.0E-09*L2_TRANS_L2_WB*64.0
+L3 evict data volume [GBytes] = 1.0E-09*IDI_MISC_WB_UPGRADE*64.0
+Dropped CLs bandwidth [MBytes/s] = 1.0E-9*IDI_MISC_WB_DOWNGRADE*64.0/time
+Dropped CLs data volume [GBytes] = 1.0E-9*IDI_MISC_WB_DOWNGRADE*64.0
+L3|MEM evict bandwidth [MBytes/s] = 1.0E-06*L2_TRANS_L2_WB*64.0/time
+L3|MEM evict data volume [GBytes] = 1.0E-09*L2_TRANS_L2_WB*64.0
+```
+
+`L3 load bandwidth`|`L3 load data volume` and the total `L3 bandwidth`|`L3 data volume` is untouched. We still have the problem that the source of loads (either memory or L3) cannot be determined.
\ No newline at end of file