Changes

Thomas Gruber · f8788569
--- a/LikwidMarkerAPIPitfalls.md
+++ b/LikwidMarkerAPIPitfalls.md
@@ -89,4 +89,54 @@ LIKWID_MARKER_INIT;
 }
 LIKWID_MARKER_CLOSE;
 ```
-This way, the benchmark can execute the array copy without influences from the MarkerAPI. The result of this loop is different to the one before but it performs the same operation. If you want reliable results, make sure the whole region is executed a reasonable amount of time (like above one second).
\ No newline at end of file
+This way, the benchmark can execute the array copy without influences from the MarkerAPI. The result of this loop is different to the one before but it performs the same operation. If you want reliable results, make sure the whole region is executed a reasonable amount of time (like above one second). When you measure the region, you might be surprised that (in case of memory counter measurements with the MEM group especially) the bandwidths decrease with increasing thread counts.
+
+# The measured times for multiple threads vary although all perform the same operation
+
+In some cases, you might see measurements like this:
+ADD STREAM TRIAD WITHOUT REGISTER
+
+Let's look at one output (just an excerpt):
+```
+```
+The problem with this code is that the first `LIKWID_MARKER_START` performs some operations that increases the runtime of the master thread (`Core 0`). You can see this especially if `ACCESSMODE=accessdaemon` because each application thread requires it's own instance of the access daemon to perform simultaneous access to the hardware registers (UNIX sockets connection between library and access daemon is not thread-safe). Other operations are the creation of hash table entries for the string `copy`. To fix this, we can tell the MarkerAPI, to do these operations already in a separate part of the application using `LIKWID_MARKER_REGISTER()`:
+```
+LIKWID_MARKER_INIT;
+#pragma omp parallel
+{
+    LIKWID_MARKER_REGISTER("copy");
+}
+#pragma omp parallel
+{
+    LIKWID_MARKER_START("copy");
+    for (k=0; k<NTIMES; k++) 
+    {
+        // copy
+        #pragma omp for
+        for (j=0; j<STREAM_ARRAY_SIZE; j++)
+            c[j] = a[j];     
+    }
+    LIKWID_MARKER_STOP("copy");
+}
+LIKWID_MARKER_CLOSE;
+```
+Although `LIKWID_MARKER_REGISTER()` is optional, it is highly recommended to register **all** regions before by **all** threads. Between the calls of `LIKWID_MARKER_REGISTER()` and `LIKWID_MARKER_START()` should be a barrier, either implicit or explicit, or you will have the same effect as not using `LIKWID_MARKER_REGISTER()` at all. The above code contains an implicit barrier as the closing of an OpenMP parallel region executes a barrier. Another method would be like this:
+
+```
+LIKWID_MARKER_INIT;
+#pragma omp parallel
+{
+    LIKWID_MARKER_REGISTER("copy");
+#pragma omp barrier
+    for (k=0; k<NTIMES; k++) 
+    {
+        LIKWID_MARKER_START("copy");
+        // copy
+        #pragma omp for
+        for (j=0; j<STREAM_ARRAY_SIZE; j++)
+            c[j] = a[j];
+        LIKWID_MARKER_STOP("copy");   
+    }
+}
+LIKWID_MARKER_CLOSE;
+```