Changes

Thomas Gruber · a8927ba6
--- a/LikwidMarkerAPIPitfalls.md
+++ b/LikwidMarkerAPIPitfalls.md
@@ -178,6 +178,7 @@ If you use a threading enviroment which is **not** based on Pthreads and the app
 With Version 4 and 5 of LIKWID, the user is able to specify multiple event sets and/or performance groups on the command line (or in the approriate environment variable). If you don't use the MarkerAPI, LIKWID switches between the groups every X seconds (selectable with `-T Xs`) and presents the values in the end. In case of the MarkerAPI, the user has to add `LIKWID_MARKER_SWITCH` in the desired code location. `LIKWID_MARKER_SWITCH` has to be called in a serial region and no application thread is allowed to access the hardware counters while `LIKWID_MARKER_SWITCH`.

 Here is an example of a valid use of `LIKWID_MARKER_SWITCH`:
+
 ```
 LIKWID_MARKER_INIT;
 #pragma omp parallel
@@ -212,9 +213,10 @@ LIKWID_MARKER_SWITCH;
 LIKWID_MARKER_CLOSE;
 ```

-The code is similar to the already used examples, we just duplicated the parallel region and switch between them. The implicit barrier at the end of the parallel region causes that no thread is still in `LIKWID_MARKER_STOP("copy")`. The code does make too much sense because we measure the `copy` kernel only with one group and the `triad` kernel with another group. If there is only a single event set/performance group available, `LIKWID_MARKER_SWITCH` does nothing.
+The code is similar to the already used examples. The implicit barrier at the end of the parallel region causes that no thread is still in `LIKWID_MARKER_STOP("copy")`. The code does not make too much sense because we measure the `copy` kernel only with one group and the `triad` kernel with another group. If there is only a single event set/performance group available, `LIKWID_MARKER_SWITCH` does nothing. So both regions would be measured with the same event set. 

 Let's look a different code:
+
 ```
 LIKWID_MARKER_INIT;
 #pragma omp parallel
@@ -239,7 +241,7 @@ LIKWID_MARKER_INIT;
 LIKWID_MARKER_CLOSE;
 ```

-From the first read, this code seems to be fine but it isn't when you think about multiple entities executing the code simulaneously. The `master` or `single` keywords just cause that the master or a single thread executes `LIKWID_MARKER_SWITCH` but there still might be another thread that is still executing the hardware registers in `LIKWID_MARKER_STOP("copy")` or might even be already in the next `LIKWID_MARKER_START("copy")`. So we have to ensure that all threads are waiting before and after the `LIKWID_MARKER_SWITCH` call:
+From the first read, this code seems to be fine but it isn't when you think about multiple entities executing the code simultaneously. The `master` or `single` keywords just cause that the master or a single thread executes `LIKWID_MARKER_SWITCH` but there still might be another thread that is still accessing the hardware registers in `LIKWID_MARKER_STOP("copy")` or might even be already in the next `LIKWID_MARKER_START("copy")`. So we have to ensure that all threads are waiting before and after the `LIKWID_MARKER_SWITCH` call:

 ```
 if (k == NTIMES/2)
@@ -252,3 +254,99 @@ if (k == NTIMES/2)
 ```

 Now we can guarantee that all threads are are finished with their measurements and that no one starts the measurement while switching the events.
+
+Generally, `LIKWID_MARKER_SWITCH` has quite a high overhead compared to the other MarkerAPI function. Setting up the hardware registers is commonly done in `LIKWID_MARKER_INIT`, hence in a part of the application which is commonly not performance critical. `LIKWID_MARKER_SWITCH` performs three operations in code regions close to performance-critical code: stopping the old event set, setting up the new event set and starting it. The recommendation is to avoid using `LIKWID_MARKER_SWITCH` and re-run the application once for each group.
+
+# How to get the measured values in my application
+
+If you want to steer the execution of your application with measurements from the MarkerAPI, you can get a thread's result by calling `LIKWID_MARKER_GET(regionTag, nevents, events, time, count)`. The function arguments are used as input and output, so here is a more detailed description (for C/C++):
+
+```
+LIKWID_MARKER_GET( const char  *regionTag,              // Region name (just input)
+                   int         *nr_events,              // Supply the length of the events array (input) and
+                                                        // contains the amount of filled entries in the events array (output)
+                   double      *events,                 // Array for the event results. Must be already allocated and length
+                                                        // must be given in nr_events (input/output)
+                   double      *time,                   // Runtime of the region (only output)
+                   int         *count)                  // Call count of the region (only output)
+```
+
+The functionality is quite simple, it checks in the thread's hash table for the region name and results all results.
+
+Example code for the usage:
+
+```
+#define NUM_EVENTS 20
+LIKWID_MARKER_INIT;
+#pragma omp parallel
+{
+    LIKWID_MARKER_REGISTER("copy");
+}
+#pragma omp parallel
+{
+    double results[NUM_EVENTS];
+    int nr_events = NUM_EVENTS;
+    double time = 0.0;
+    int count = 0;
+    int tid = omp_get_thread_num()
+    LIKWID_MARKER_START("copy");
+    for (k=0; k<NTIMES; k++) 
+    {
+        // copy
+        #pragma omp for
+        for (j=0; j<STREAM_ARRAY_SIZE; j++)
+            c[j] = a[j];     
+    }
+    LIKWID_MARKER_STOP("copy");
+    // here nr_events = NUM_EVENTS
+    LIKWID_MARKER_GET("copy", &nr_events, (double*)results, &time, &count);
+    // here nr_events = events in the event set
+    printf("Thread %d: called region copy %d times, taking %f seconds\n", tid, count, time);
+    for (k = 0; k < nr_events; k++)
+        printf("Thread %d: Event %d: %f\n", tid, k, results[k]);
+}
+LIKWID_MARKER_CLOSE;
+```
+
+There is not much to think about when using `LIKWID_MARKER_GET`, just execute it by the thread you want the results from. If you call it in a serial region, you get the values of the master thread only!
+
+# Resetting the results of a region
+
+In some cases it might be required to reset the results of a region. Examples are changed runtime settings like blocking factors or CPU frequencies based on measurement results (`LIKWID_MARKER_GET`). For these cases, the MarkerAPI contains the `LIKWID_MARKER_RESET` macro.
+
+Here is an example code (Function code omitted):
+
+```
+LIKWID_MARKER_INIT;
+#pragma omp parallel
+{
+    LIKWID_MARKER_REGISTER("sum");
+}
+#pragma omp parallel
+{
+    LIKWID_MARKER_START("sum");
+    for (k=0; k<NTIMES; k++) 
+    {
+        vector_sum_normal(a, &sum);
+    }
+    LIKWID_MARKER_STOP("sum");
+}
+// Evaluate normal vector sum
+if (redo_with_kahan)
+{
+    #pragma omp parallel
+    {
+        LIKWID_MARKER_RESET("sum")
+        LIKWID_MARKER_START("sum");
+        for (k=0; k<NTIMES; k++) 
+        {
+            vector_sum_kahan(a, &sum);   
+        }
+        LIKWID_MARKER_STOP("sum");
+    }
+}
+LIKWID_MARKER_CLOSE;
+```
+
+This code does a vector summation (`sum += A[i]`) in a seperate function and if the evaluation tells to use [Kahan summation](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) instead, we reset the measurements and do the Kahan summation again using the same region name. Of course, `likwid-perfctr` has no clue about `LIKWID_MARKER_RESET`, so the "final" results printed by `likwid-perfctr` are those after the last `LIKWID_MARKER_RESET`.
+