Changes

Thomas Gruber · 63ec9d61
--- a/LikwidMarkerAPIPitfalls.md
+++ b/LikwidMarkerAPIPitfalls.md
@@ -4,7 +4,6 @@ Todos:
 * LIKWID_MARKER_SWITCH
 * LIKWID_MARKER_GET
 * LIKWID_MARKER_RESET
-* LIKWID_MARKER_THREADINIT

 # Motivation 

@@ -150,3 +149,88 @@ LIKWID_MARKER_INIT;
 }
 LIKWID_MARKER_CLOSE;
 ```
+
+# What about `LIKWID_MARKER_THREADINIT`?
+
+In LIKWID 3 and 4, the application threads needed to be registered in the MarkerAPI using `LIKWID_MARKER_THREADINIT`. This is not required anymore because the MarkerAPI is able to determine new threads by itself (Version 5). The call is still present and can be called but commonly has no effect anymore. There is **one** exception:
+If you use a threading enviroment which is **not** based on Pthreads and the application does not pin the threads itself to hardware threads, you have to call `LIKWID_MARKER_THREADINIT` by each thread to perform the pinning. The MarkerAPI measures only on selected hardware threads and if your application thread runs on a different one, you get bad results and maybe errors.
+
+# How to use `LIKWID_MARKER_SWITCH`
+
+With Version 4 and 5 of LIKWID, the user is able to specify multiple event sets and/or performance groups on the command line (or in the approriate environment variable). If you don't use the MarkerAPI, LIKWID switches between the groups every X seconds (selectable with `-T Xs`) and presents the values in the end. In case of the MarkerAPI, the user has to add `LIKWID_MARKER_SWITCH` in the desired code location. `LIKWID_MARKER_SWITCH` has to be called in a serial region and no application thread is allowed to access the hardware counters while `LIKWID_MARKER_SWITCH`.
+
+Here is an example of a valid use of `LIKWID_MARKER_SWITCH`:
+```
+LIKWID_MARKER_INIT;
+#pragma omp parallel
+{
+    LIKWID_MARKER_REGISTER("copy");
+}
+#pragma omp parallel
+{
+    LIKWID_MARKER_START("copy");
+    for (k=0; k<NTIMES; k++) 
+    {
+        // copy
+        #pragma omp for
+        for (j=0; j<STREAM_ARRAY_SIZE; j++)
+            c[j] = a[j];     
+    }
+    LIKWID_MARKER_STOP("copy");
+}
+LIKWID_MARKER_SWITCH;
+#pragma omp parallel
+{
+    LIKWID_MARKER_START("triad");
+    for (k=0; k<NTIMES; k++) 
+    {
+        // triad
+        #pragma omp for
+        for (j=0; j<STREAM_ARRAY_SIZE; j++)
+            a[j] = b[j]+scalar*c[j];     
+    }
+    LIKWID_MARKER_STOP("triad");
+}
+LIKWID_MARKER_CLOSE;
+```
+
+The code is similar to the already used examples, we just duplicated the parallel region and switch between them. The implicit barrier at the end of the parallel region causes that no thread is still in `LIKWID_MARKER_STOP("copy")`. The code does make too much sense because we measure the `copy` kernel only with one group and the `triad` kernel with another group. If there is only a single event set/performance group available, `LIKWID_MARKER_SWITCH` does nothing.
+
+Let's look a different code:
+```
+LIKWID_MARKER_INIT;
+#pragma omp parallel
+{
+    LIKWID_MARKER_REGISTER("copy");
+#pragma omp barrier
+    for (k=0; k<NTIMES; k++) 
+    {
+        LIKWID_MARKER_START("copy");
+        // copy
+        #pragma omp for
+        for (j=0; j<STREAM_ARRAY_SIZE; j++)
+            c[j] = a[j];
+        LIKWID_MARKER_STOP("copy");   
+        if (k == NTIMES/2)
+        {
+            #pragma omp master // or single
+            LIKWID_MARKER_SWITCH;
+        }
+    }
+}
+LIKWID_MARKER_CLOSE;
+```
+
+From the first read, this code seems to be fine but it isn't when you think about multiple entities executing the code simulaneously. The `master` or `single` keywords just cause that the master or a single thread executes `LIKWID_MARKER_SWITCH` but there still might be another thread that is still executing the hardware registers in `LIKWID_MARKER_STOP("copy")` or might even be already in the next `LIKWID_MARKER_START("copy")`. So we have to ensure that all threads are waiting before and after the `LIKWID_MARKER_SWITCH` call:
+
+```
+if (k == NTIMES/2)
+{
+    #pragma omp barrier
+    #pragma omp master // or single
+    LIKWID_MARKER_SWITCH;
+    #pragma omp barrier
+}
+```
+
+Now we can guarantee that all threads are are finished with their measurements and that no one starts the measurement while switching the events.
\ No newline at end of file