Updated L2 L3 MEM traffic on Intel Skylake SP CascadeLake SP (markdown) authored by Thomas Gruber's avatar Thomas Gruber
...@@ -177,4 +177,72 @@ The results show: ...@@ -177,4 +177,72 @@ The results show:
So, if we leave out the `L2_TRANS_L2_WB` from the `L3` performance group, we can include both `IDI_MISC_WB*` events and still have one counter register left. In this counter we could measure the L3 hits to characterize the load path. Unfortunately, the `MEM_LOAD_L3_*` are likely to be listed in the specification updates. This is not the case for [Intel Skylake SP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) but for [Intel Skylake Desktop (SKL128)](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf). So, if we leave out the `L2_TRANS_L2_WB` from the `L3` performance group, we can include both `IDI_MISC_WB*` events and still have one counter register left. In this counter we could measure the L3 hits to characterize the load path. Unfortunately, the `MEM_LOAD_L3_*` are likely to be listed in the specification updates. This is not the case for [Intel Skylake SP](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) but for [Intel Skylake Desktop (SKL128)](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf).
So, the new L3 performance group will be
```
SHORT L3 cache bandwidth in MBytes/s
EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
PMC0 L2_LINES_IN_ALL
PMC1 IDI_MISC_WB_UPGRADE
PMC2 IDI_MISC_WB_DOWNGRADE
PMC3 MEM_LOAD_RETIRED_L3_HIT
MBOX0C0 CAS_COUNT_RD
MBOX0C1 CAS_COUNT_WR
MBOX1C0 CAS_COUNT_RD
MBOX1C1 CAS_COUNT_WR
MBOX2C0 CAS_COUNT_RD
MBOX2C1 CAS_COUNT_WR
MBOX3C0 CAS_COUNT_RD
MBOX3C1 CAS_COUNT_WR
MBOX4C0 CAS_COUNT_RD
MBOX4C1 CAS_COUNT_WR
MBOX5C0 CAS_COUNT_RD
MBOX5C1 CAS_COUNT_WR
METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
MEM->L2 load bandwidth [MBytes/s] 1.0E-06*(PMC0-PMC3)*64.0/time
MEM->L2 load data volume [GBytes] 1.0E-09*(PMC0-PMC3)*64.0
L3->L2 load bandwidth [MBytes/s] 1.0E-06*(PMC3)*64.0/time
L3->L2 load data volume [GBytes] 1.0E-09*(PMC3)*64.0
L2->L3 evict bandwidth [MBytes/s] 1.0E-06*PMC1*64.0/time
L2->L3 evict data volume [GBytes] 1.0E-09*PMC1*64.0
L2 dropped CLs data volume [GBytes] 1.0E-09*PMC2*64.0
L2<->L3|MEM bandwidth [MBytes/s] 1.0E-06*(PMC0+PMC1)*64.0/time
L2<->L3|MEM data volume [GBytes] 1.0E-09*(PMC0+PMC1)*64.0
Memory read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0)*64.0/time
Memory read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0)*64.0
Memory write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1)*64.0/time
Memory write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1)*64.0
Memory bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1)*64.0/time
Memory data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1)*64.0
LONG
Formulas:
MEM->L2 load bandwidth [MBytes/s] = 1.0E-06*(L2_LINES_IN_ALL-MEM_LOAD_RETIRED_L3_HIT)*64.0/time
MEM->L2 load data volume [GBytes] = 1.0E-09*(L2_LINES_IN_ALL-MEM_LOAD_RETIRED_L3_HIT)*64.0
L3->L2 load bandwidth [MBytes/s] = 1.0E-06*MEM_LOAD_RETIRED_L3_HIT*64.0/time
L3->L2 load data volume [GBytes] = 1.0E-09*MEM_LOAD_RETIRED_L3_HIT*64.0
L2->L3 evict bandwidth [MBytes/s] = 1.0E-06*IDI_MISC_WB_UPGRADE*64.0/time
L2->L3 evict data volume [GBytes] = 1.0E-09*IDI_MISC_WB_UPGRADE*64.0
L2 dropped CLs data volume [GBytes] = 1.0E-09*IDI_MISC_WB_DOWNGRADE*64.0
L2<->L3|MEM bandwidth [MBytes/s] = 1.0E-06*(L2_LINES_IN_ALL+IDI_MISC_WB_UPGRADE)*64/time
L2<->L3|MEM data volume [GBytes] = 1.0E-09*(L2_LINES_IN_ALL+IDI_MISC_WB_UPGRADE)*64
Memory read bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD))*64.0/runtime
Memory read data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD))*64.0
Memory write bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_WR))*64.0/runtime
Memory write data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_WR))*64.0
Memory bandwidth [MBytes/s] = 1.0E-06*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0/runtime
Memory data volume [GBytes] = 1.0E-09*(SUM(CAS_COUNT_RD)+SUM(CAS_COUNT_WR))*64.0
-
Profiling group to measure L2 - L3 - memory traffic. This group differs from previous CPU generations due to the
L3 victim cache. Since data can be loaded from L3 or memory, the memory controllers need to be measured as well.
```
(##) Commonly, all data should be loaded from memory directly to L2 except the LLC prefetcher is active (like in this case). One might assume that all cache lines evicted to L3 for re-use are also loaded again from L3 but that would mean that the heuristics are always the optimal decision. (##) Commonly, all data should be loaded from memory directly to L2 except the LLC prefetcher is active (like in this case). One might assume that all cache lines evicted to L3 for re-use are also loaded again from L3 but that would mean that the heuristics are always the optimal decision.
\ No newline at end of file