-
Notifications
You must be signed in to change notification settings - Fork 0
/
projekt-01-kapitoly-chapters.tex
1491 lines (1283 loc) · 76.2 KB
/
projekt-01-kapitoly-chapters.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Introduction}
Performance of an operating system is crucial as any triggered degradation can
significantly affect the performance of all applications running above it.
Moreover, when a~new version is released and it introduces a~performance
regression, it can break the stability of e.g. business applications leading to
great financial losses. An important part of operating system and the main
influence on its performance is the implementation and strategy of a~process
scheduler, which manages processes and their processor time.
In the Linux kernel, scheduler used to be simple, but with an introduction of
multi-core CPUs achieving stable performance required adapting
multiple runqueues for each core and
their balancing. This lead to complex code and it took some time
to eliminate the bugs and tune the scheduler's behavior. Another change came with
symmetric multiprocessing technology bringing to market machines with multiple
CPU sockets and non-uniform memory organization and access\;\cite{wasted-cores}.
Scheduling on such systems is still under an active development and therefore
it is even more prone to performance degradation bugs. In order to discover these bugs and
to keep the performance of the operating system stable, using performance testing was essential.
Due to this fact, the performance of the scheduler has to be evaluated
before each release.
Compared to functional testing, performance cannot be evaluated as just true and
false results. It is more challenging as we have to (1) compare it relatively with
previous versions or measurements, and (2) choose a~suitable threshold when reporting
the performance regression. Due to complexity of the scheduler, the common tools
for inspecting performance may often yield unsatisfying results.
Moreover, the biggest performance regressions of the scheduler do not dwell in an
inefficient code, but instead in an inefficient assignment of processes to the
CPU cores and their queues.
This thesis focuses on performance testing of the Red Hat Enterprise Linux
(RHEL) kernel scheduler managed by the Red Hat, Inc. company.
The usual performance testing method of the scheduler is to simulate load
similar to the real usage. Currently, there are many benchmarks targeting different
types of load, usually spawning many parallel process, sometimes even
communicating between each other. The results of measurements of the performance
must be stored systematically and effectively for later
comparisons. So in order to effectively interpret the collected
results and their comparisons, their effective visualization is essential.
In this thesis, we propose a~new interpretation of long-term results of the
benchmarks using box plot graphs in order to enhance the process of data
analysis. This visualization should help with inspecting the measurement
stability of benchmarks and with finding versions where performance degradations
appeared or were fixed.
Moreover, to reduce time one has to spend analyzing these comparisons,
we propose an utilization of machine learning methods that will
automatically check for possible degradations in the Linux kernel scheduler. We will
describe how to create the dataset for classification, compare different
classifiers, and evaluate their accuracy on the dataset.
This thesis is structured as follows. In Chapter \ref{ch:scheduler} we describe
Completely Fair Scheduler -- the current process scheduler of the Linux kernel,
architecture of symmetric multiprocessing systems, and how the scheduler handles
them. In Chapter \ref{ch:measurement}, we describe the performance measurement of
the Linux process scheduler. First we introduce benchmarks used in Red Hat for
load testing of the scheduler performance and then we describe complementary tools for analysis of
behavior of the scheduler and of the system. Ways of storage of the collected data
and their visualization for comparison are described in Chapter
\ref{ch:processing}.
We propose a~new way of visualization of long-term results comparison called
\emph{Timelines} in Chapter \ref{ch:timelines}. In Chapter
\ref{ch:ai}, we propose utilization of machine learning methods for automatic
classification of comparison of two results to recognize a~performance
regression.
\chapter{Linux process scheduling} \label{ch:scheduler}
Process scheduler is a~part of an operating system which assigns the processor
time to tasks. Its main goal is to maximize effectivity of the CPU usage and to
ensure fairness of the CPU time assigned to each task.
There are two opposing targets for a~scheduler: either maximizing the throughput or
minimizing the latency. Lower amount of context switches leaves more CPU time for
tasks, but raises the response time on system events.
While users' workstations aims for a~low response time, computational servers
require high throughput. Scheduler can then be usually tuned to fit the intended
purpose.
In this chapter we describe basic behavior of Linux process scheduler. Then we
compare uniform and non-uniform memory access on multiprocessor architectures
and how scheduler handles them.
\section{Completely Fair Scheduler}
Completely Fair Scheduler (CFS) is the current Linux process scheduler, which was
merged into the version 2.6.23 of the Linux kernel in 2007. Its author is Ingo
Molnár, who is the author of the previous \emph{O(1) scheduler} as well.
CFS features queuing tasks in a~red-black tree structure ordered by time spent
running on the CPU so far. Red-black tree is a~binary search
tree with self-balancing mechanism based on marking nodes with either red or
black color. When the scheduler needs to choose the next task to run, it takes
the leftmost node with the lowest execution time.
The time complexity of the CFS scheduling is O(log N), where N is the number
of running tasks. Taking the leftmost node
with the next scheduled task can be done in a~constant time, but queuing the task
requires O(log N) operations to insert it back into the red-black tree. Even
with a~higher scheduling complexity, the CFS scheduler has a~better fairness and
responsiveness than the previous O(1) scheduler, which used a~simple queue to
choose the next task.
On multi-core systems, the scheduler uses a~separate queue for each core. In
order to effectively use the processing power, the scheduler must regularly
balance those queues by moving processes from the most busy cores to idle ones.
When moving processes between cores, scheduler takes in the account a~topology
of the system. Losing data from caches after the migration can have a~bigger
impact on performance than leaving the process on the busy core.
CFS solves this problem by using scheduling
domains\;\cite{understanding-linux-kernel}. Scheduling domain is a~set
of CPUs that should be balanced between themselves. CPUs in a~scheduling domain
are divided into groups. Scheduler then checks the load of the whole groups to
decide if there is a~need to migrate processes between them.
There are multiple levels of scheduling domains with different parameters such as
how often is the load difference checked or how big the load difference between
groups must be to migrate tasks to balance the queues.
The lowest level is between hyper-threaded logical cores where there are almost
no losses of cached data and rebalancing can be done quite often.
A higher level is between physical processor cores where cache losses can have
bigger impact on the decision to migrate the task.
Above those cores can be processor sockets on machines with multiple physical
processors with different access speed to different memory sections.
Scheduling domains are regularly rebalanced by going up from bottom of the
scheduling domain hierarchy (illustrated by Figure \ref{fig:sched-domains}) and
by checking balance of the groups on each level.
\begin{figure}
\centering
\includegraphics[width=4cm]{sched-domains}
\caption{Hierarchy of scheduling domains\cite{sched-groups-lwn}}
\label{fig:sched-domains}
\end{figure}
\section{Scheduling on SMP systems}
Symmetric multiprocessing (SMP) is an architecture of computers with multiple
physical processors that have a~single shared memory, a~shared access to all IO
devices, and that run on the same instance of operating system. This allows the
machine to offer more processing power with a~little overhead caused by memory
sharing.
Each processor has still its own high speed cache, but due to memory sharing,
\emph{cache coherence} must be maintained -- the data shared between processors
in their caches must be uniform.
There are two ways of accessing the shared memory from multiple processors:
uniform (UMA) and non-uniform (NUMA) memory access. When correctly used, the
NUMA technology has higher performance capability with similar configuration
compared to UMA systems. In our experience most paying customers of Red Hat use
the NUMA technology and we will primarily focus on scheduler behavior on this
SMP architecture.
\subsection{Uniform memory access}
In the UMA architecture, all processors share a~single memory controller that
they use to access the shared memory. Therefore, each processor has the same
speed of memory access and the same latency. They share a~common access route to
the memory, which brings more simplicity at the cost of a~lower bandwidth and
speed.
In this architecture it is easier for the scheduler to balance processes
between the physical processors. The time to access the shared memory is the
same on all of the cores and therefore there is no need to move the memory
associated with a~process to any other place for faster access.
\subsection{Non-uniform memory access}
The NUMA architecture tries to solve the problem with low bandwidth.
It arranges physical processors or cores into nodes, where each node has its
own separate memory and a~bus for faster access. This significantly improves
overall memory throughput of the system when used correctly.
Nodes also have to be connected to each other to access memory of other
nodes. That is achieved using either interconnecting buses or controllers. Each
manufacturer has its own technology implementing the interconnection: Intel uses
Ultra Path Interconnect which replaced its QuickPath Interconnect from older
machines. On the contrary, AMD uses Infinity Fabric supersetting the older
HyperTransport.
On bigger machines with a~large amount of processors, not every two processors
are necessarily connected. Instead, they may access the data through a~path
of connected nodes. This can be seen in Figure \ref{fig:proliant} showing an 8
NUMA node machine with an advanced structure of interconnect buses and
controllers.
%Best Practices When Deploying Linux on DL980 4AA3-6556ENW.pdf
\begin{figure}
\centering
\includegraphics[width=14cm]{proliant}
\caption{Architecture of NUMA communication on HP ProLiant
DL980\;\cite{proliant}. Each processor connected with its own local memory
makes up a~NUMA node. Each pair of nodes have dedicated interconnect bus for
faster data transfer between them. Communication with other nodes is
realized through node controllers. The topology of the interconnect buses
shows there will be 4 different access speeds depending on the distance
between nodes. The local access is the fastest, then follow the access to
the neighbor node and the access through one controller, and the slowest is the
access through both controllers. The interconnect buses are doubled to avoid
overloading one of them.}
\label{fig:proliant}
\end{figure}
Consequence of interconnection between NUMA nodes is the different latency
between nodes which must be taken into account when balancing tasks between
nodes. Difference in the access latency between a~pair of NUMA nodes for the
machine from Figure \ref{fig:proliant} can be seen in the following part of
output from a~command \verb|numactl --hardware|:
\begin{minipage}{\linewidth}
\begin{verbatim}
node distances:
node 0 1 2 3 4 5 6 7
0: 10 12 17 17 19 19 19 19
1: 12 10 17 17 19 19 19 19
2: 17 17 10 12 19 19 19 19
3: 17 17 12 10 19 19 19 19
4: 19 19 19 19 10 12 17 17
5: 19 19 19 19 12 10 17 17
6: 19 19 19 19 17 17 10 12
7: 19 19 19 19 17 17 12 10
\end{verbatim}
\end{minipage}\\
The \texttt{numactl} utility also provides possibility of pinning processes to
a~specific NUMA node to ensure the process and its memory will stay on the
intended node. This allows user to arrange processes manually in a~way considered
as the best for maximum performance.
Balancing tasks between NUMA nodes is difficult for the scheduler since it needs
to take into account an expensive memory movement or access to different nodes.
With a~wrong approach the performance of a~NUMA system can drop even below the
performance of a~similar UMA
system\footnote{http://highscalability.com/blog/2013/5/30/google-finds-numa-up-to-20-slower-for-gmail-and-websearch.html}.
Balancing processes between NUMA nodes is still in active development, which
brings many changes. These usually improve performance, however, there are cases
when a~change can cause a~performance regression. Therefore, it is essential to
carry out a~thorough performance testing of the scheduling.
\chapter{Performance measurement} \label{ch:measurement}
Performance testing is examination of the system behavior under a~workload and of its
effectivity of resource usage. For many systems, the time they are able to
respond in is crucial and this property affects the usability of the system.
Compared to functional testing, performance testing does not produce an exact true
or false result. It produces a~set of numerical values which must be compared to
presumed values or values from an other version to make a~conclusion.
After the performance measurement, the next step is inspection of behavior of
the system to understand the measured values and to determine the causes of the
difference from the expected values.
In order to measure the performance of a~scheduler, we use benchmarks.
Benchmarks generate artificial load imitating the load in real environment.
While stress testing the system, they also measure its performance. The benchmarks
typically return a~value representing the performance of the system. The value
is usually in the form of the time that the task needed to finish or of the amount of
the operations that the system could perform per a~unit of time.
In the previous chapter, we have introduced the current state of task scheduling in
the Linux kernel.
Effectivity of the scheduling and of task migration between processors affects
the amount of the tasks that the system can handle and the time that a~task
spends before finishing.
Although the available benchmarks generate values that are suitable for comparison, they
do not provide any more detailed information about how the system achieved the
measured performance or where the possible bottlenecks could
be\;\cite{active-benchmarking}.
To get a~better insight into the behavior of the system, there are many tools to
collect performance records about the system behavior. Useful information about the
scheduler include assignment of tasks to the processor cores, the time that the
tasks spent out of the CPU in queues, load of each processor core, or the
location of memory of the processes on NUMA systems.
\section{Performance metrics} \label{sec:metrics}
We now present typical metrics used in performance measurement.
\paragraph{Throughput} These metrics represents the amount of operations per time
unit. The operations can be, e.g., floating point operations, or operation as
specific tasks defined by a~benchmark (e.g. SPEC Java benchmarks). Time units are usually
seconds. Higher values mean better performance.
\paragraph{Run time} This metric is simply the real time that the benchmark needed to finish the
execution. It is mostly presented in seconds, but longer runs can be presented
in a~more human-friendly format, converting the time to hours, minutes, and seconds. The
lower the run time is, the better performance it represents.
\paragraph{Time out of CPU} This metric represents the time that the benchmark spent in process
queue waiting for execution. This metric can be calculated from user, kernel, and
real time provided by the \emph{time} utility and from the number of threads, that the
application used:
$$T_{out\_of\_cpu} = (T_{real} \times used\_threads) - (T_{user} + T_{kernel})$$
The lower this value is, the better behavior of the scheduler it represents.
\section{Benchmarks}
In the following section we will describe the benchmarks that are currently used
to evaluate performance of the latest kernel versions. The benchmarks are
usually based on real applications used both in scientific and in business
environments and are meant to stress test the system.
The benchmarks run in many threads or processes and feature communication
between them to utilize communication between processors. A bad distribution of
processes and threads by the scheduler increases their time of waiting in the
queue to rise and the performance of the system naturally goes down. Moreover,
on NUMA systems, the performance depends also on placement of data in the
memory.
\subsection{NAS Parallel Benchmarks}
NAS Parallel Benchmarks\;\cite{nas-parallel} is a~set of programs focused on
performance of highly parallel computations on supercomputers. In addition to
floating point computations, it targets communication and data movement among
computation nodes. The performed algorithms are based on large scale
computational fluid dynamics at the Numerical Aerodynamic Simulation (NAS)
Program which is based at NASA Ames Research Center.
Benchmarks are written in Fortran-90 or in C language, as these were the most
commonly used programming languages in scientific parallel computing community
at the time when the benchmarks were created. They can be compiled with
different classes of problem sizes to suit machines with different amount of
memory and of computational power.
The main output value of the benchmark is the throughput measured in units
called Mop/s (millions of operations per second) representing the amount of
floating-point operations per unit of time.
The benchmark also offers a~few parameters that can be passed to the benchmark
before the execution. One of them is the number of computation threads, which in
a~lower amount slows down the run time, but allows one to measure behavior of
the system without full usage. Figure \ref{fig:nas} shows an example of
a~throughput with different number of threads on a~machine with 24 physical cores
and hyper-threading.
The downside of this benchmark is it can only run with a~fixed dataset, but not
for a~fixed time period. This constraint makes the run time of the benchmark
with less threads longer.
\begin{figure}
\centering
\includegraphics[width=12cm]{nas}
\caption{Example of Scalar Penta-diagonal solver results from the NAS Parallel
benchmark with different number of computational threads. The size of the
boxes represent the inaccuracy of measurement caused by noise and by
non-deterministic behavior of scheduler.}
\label{fig:nas}
\end{figure}
\subsection{SPECjbb2005}
Java Business Benchmark\;\cite{jbb2005} created by Standard Performance
Evaluation Corporation (SPEC) behaves as a~server-side Java application. It is
primarily focused on measuring performance of Java core implementation, but it
also reflects a~performance of operating system and of the CPU itself. It models
a~system of a~wholesale company as a~multitier application. The benchmark itself
creates the load, measures the throughput, and also generates a~simple report in
HTML and raw text formats.
The main output value is \emph{throughput} measured in units called
\emph{SPECjbb2005 bops} \footnote{Business operations per second}. In case we
use more JVM\footnote{Java virtual machine} instances, there is a~second unit
called SPECjbb2005 bops/JVM representing an average throughput of a~single JVM
instance. The collected metric is memory consumption, which is not that
important for scheduler performance monitoring.
\subsection{SPECjvm2008}
Java Virtual Machine Benchmark\;\cite{jvm2008} is another benchmark from SPEC
focused on Java Virtual Machine and Java Runtime Environment, but reflecting
also behavior of process scheduler and memory management.
It consists of separate operations using real life applications (e.g. Sunflow
rendering, Java compiler) or stressing specific part of Java implementation
(e.g. cryptography algorithms, working with XML documents, Scimark floating
point benchmark). The operations run in multiple threads sharing data both on
application level and in common JVM instance.
Output of the benchmark uses \emph{ops/m} unit describing number of executed
operations per minute.
\subsection{LINPACK benchmark}
LINPACK Benchmark\cite{linpack} comes from the LINPACK package, which was used to
solve systems of linear equations. It is primarily
used to measure floating point computation power of large machines at both
single and double precision. It is used also by the TOP500 project for building list
of 500 most powerful computers.
Main measured value from the benchmark is the number of floating point operations per
second (\emph{flops}), but the benchmark output provides more information such as
run time, page faults, context switches, or location on which core or NUMA node
it ran.
\subsection{STREAM benchmark}
The STREAM benchmark\;\cite{stream} is a~simple benchmark aimed for measuring memory
bandwidth and also computation speed of simple vector kernels. Its motivation is
a~slow grow of memory performance compared to CPU, which makes speed of this
component also crucial for the performance of the whole system.
Output metric of the benchmark is memory bandwidth in \emph{MB/s}, but it
provides the same statistics as the Linpack benchmark: run time, page faults,
context switches, or location on which core or NUMA node it ran.
\section{Performance analysis tools}
Although benchmarks create workload on tested system and collect some metric of
its performance, tools for performance analysis are needed to get better insight
to behavior of the system. To the interesting information belong utilization of
processor cores, migration of benchmark process and its memory between NUMA
nodes, or the time the process spent in queue out of CPU.
This section describes utilities used to collect data to get the insight to
system behavior.
\subsection{time}
Time is a~simple command for measuring the time that an application spends
running. The most common numbers it reports are the total real time that the
application needed to finish, the time that it spent in the user mode, and the
time spent in the kernel mode.
Many benchmarks provide the execution time themselves which can make this
utility unnecessary. However, the interesting metric is the time that the
application spent out of the CPU waiting in queue. This value is not provided
directly, but time provides user, kernel and real time from which the metric can be
calculated as described in Section \ref{sec:metrics}.
It can be confused with the Bash builtin command \texttt{time}, which provides similar
output, but the real binary can provide more verbose information with
the possibility of custom formatting of output. It can be usually called from
\texttt{/usr/bin/time}.
\subsection{ps}
Ps is a~Linux command, which is used to display information about active processes. Its name
stands for ``processes status''. It can provide various information obtainable from
the virtual files in the \texttt{/proc} directory. The most common information
include the process PID, time spent on processor, state of the process, used memory,
associated terminal, the command that started the process, and more.
Especially useful are the optional columns \texttt{PSR} and \texttt{NUMA}. They
show the number of the processor and the number of the NUMA node where the process is
running. Continuous monitoring of those values can provide view on migration of
the process during its run time.
Output of the command can be filtered in many ways. By default, it shows only
processes for the current user and for the current terminal, but it can list all
the processes on the system. The listing can be filtered using parameters by
most of the columns of information it provides. The listing can be limited for
instance by a~specific terminal, by an effective user, by children of a~specified process,
or by a~PID to a~single intended process.
\subsection{mpstat}
This Linux command provides continuous information about utilization of CPUs. It
can show utilization of processor cores, of NUMA nodes, or of the whole system
dependent on passed argument. Some of the provided values are utilization in
user space, utilization in kernel space, percentage of waiting for IO
operations, percentage of handling interrupt requests and the idle percentage,
when system is idle and does not wait for any IO operation.
Mpstat can collect those data once when executed or at regular time intervals.
The regular collecting of utilization of the CPU cores or of the NUMA nodes is
done through the run time of a~benchmark. With little processing of the data, it
is easy to watch whether the distribution of load between the cores and the NUMA
nodes is equal or not.
\subsection{turbostat}
This tool provides measuring of hardware properties of the CPUs with the x86
architecture. It reports for each core its usage, frequency, temperature, and
percentage of time in different idle states. For each socket it reports its
power consumption.
There are two ways to run turbostat. It can be supplied with a~command to run
and it will return the average values from the run time of the command. Without
the command it will collect the statistics at regular time intervals.
Data from this tool can be used to analyze performance drop caused by the CPU
itself. This can happen due to frequency drop because of overheating or of
missing workload. The power consumption data can be used to roughly compare the
power efficiency of both the scheduler and the physical CPUs, but the command
only provides consumption of the CPUs and their RAM and not of the whole
machine.
\subsection{perf}
Linux command \texttt{perf}, also known as \texttt{perf\_events}, is a~tool for
profiling with performance counters Linux subsystem. It provides counting of
\emph{hardware events} (e.g. cpu cycles, branch and cache misses),
\emph{software events} (e.g. context switches, page faults), or custom
\emph{tracepoints} (e.g. specific system calls, filesystem or network
operations).
It offers wide range of commands, from which the most used are:
\paragraph{\texttt{perf stat}} The command counts selected events during an
execution of a~process or during a~specified time period. It can observe events belonging
to the process or system-wide. The counted statistics are written at the end of the
time interval or of the process execution.
\paragraph{\texttt{perf record}} The command record the events to perf.data file for later
analysis.
\paragraph{\texttt{perf report}} The command reads the perf.data file created by perf record
and displays the collected statistics.
\paragraph{\texttt{perf top}} The command provides live analysis of system by showing all
observed function calls ordered by the number of cycles spent in them.
\chapter{Processing of results} \label{ch:processing}
Getting the output of benchmarks and tools for performance analysis is just
a~part of performance regression testing. To perform comparison of two results, it is
essential to store the results in an efficient way for quick creation of
a~comparison report. The quality of the comparison report also affects the right
choice and a~use of visualization of the results and their comparison.
This chapter is structured as follows. Section \ref{sec:storage} describes
ways of storing the results from benchmarks and Section \ref{sec:visualization}
describes different visualization methods suitable for analysis of performance
changes.
\section{Storage of the results} \label{sec:storage}
Benchmarks sometimes generate long human-readable output in text or even HTML
format. This is useful when analyzing a~single report. In the output there are details
of the test run itself, a~simple resource usage, or success of result validation.
However, the amount of result starts to rise with repeated runs, with different
amount of instances, and with new versions kernels.
For the comparison of performance results, the number representing throughput or
time of each benchmark run is usually enough. Those numbers can be preprocessed
from the benchmark output files to a~format more suitable for quick accessing
of the required data.
\subsection{XML files}
XML is a~markup language, that can store heterogeneous data in a~tree structure.
The tree structure can effectively represent the test scenario running each
benchmark operation with different parameters and multiple repetitions.
Another feature of XML format is human-readability offering quick insight to stored
data just with any text editor.
This comes with a~disadvantage of redundant data in the form of repeated names of
tags and attributes which often take more space than the data itself. Parsing of
the data also takes considerable amount of the CPU time prolonging the duration of
analysis.
In our team, we use the XML format for storing result values from benchmark runs
and their aggregated statistics for easier generation of performance comparison
reports. Next to the results is also stored configuration of the benchmark run
containing properties of the system that the benchmark ran on. The properties include
hostname of the testing machine, version of kernel and of operating system, or
configuration of the environment.
Example of results from one run of the NAS Parallel benchmark scenario stored in
the XML format is in Figure \ref{fig:xml_result}. Example with aggregated
data is in Figure \ref{fig:xml_sums}. Example of an XML file with properties
of benchmark run result is shown in Figure \ref{fig:xml_config}.
\begin{figure}
\small
\begin{verbatim}
<?xml version="1.0"?>
<beaker_run_result>
<test_result>
<nas_result benchmark_name="mg_C_x">
<threads number="2">
<result mops="5560.4" real_time="31.0" out_of_cpu_time="0.9"/>
<result mops="5411.4" real_time="32.2" out_of_cpu_time="1.6"/>
<result mops="5499.3" real_time="31.4" out_of_cpu_time="0.6"/>
<result mops="5407.4" real_time="31.9" out_of_cpu_time="1.3"/>
<result mops="5376.1" real_time="32.0" out_of_cpu_time="0.7"/>
</threads>
<threads number="4">
<result mops="10254.9" real_time="16.8" out_of_cpu_time="1.2"/>
<result mops="10075.4" real_time="17.1" out_of_cpu_time="1.7"/>
<result mops="10226.6" real_time="16.8" out_of_cpu_time="1.4"/>
<result mops="10250.2" real_time="16.8" out_of_cpu_time="1.3"/>
<result mops="10227.7" real_time="17.0" out_of_cpu_time="2.5"/>
</threads>
...
\end{verbatim}
\normalsize
\caption{This example shows beginning of XML file with important values from one NAS
Parallel benchmark run scenario. The XML file starts with root element
\texttt{<beaker\_run\_result>} and \texttt{<test\_run>} node which are wrapping
\texttt{<nas\_result>} nodes representing results from each benchmark operation
from NAS Parallel benchmark suite. Each benchmark operation is run with
different amount of threads in few loops to lower the measurement inaccuracy.
Nodes of results with the same number of threads are wrapped in
\texttt{<threads>} node. All the values are stored as attributes of the
corresponding node.}
\label{fig:xml_result}
\end{figure}
\begin{figure}
\small
\begin{verbatim}
<?xml version="1.0"?>
<beaker_run_result>
<test_result>
<nas_result benchmark_name="mg_C_x">
<threads number="2">
<mops mean="5450.9" stdev="68.4" first_q="5407.4"
median="5411.4" third_q="5499.3" max="5560.4" min="5376.1"/>
<total_time mean="31.7" stdev="0.4" first_q="31.4"
median="31.9" third_q="32.0" max="32.2" min="31.0"/>
<out_of_cpu_time mean="1.0" stdev="0.4" first_q="0.7"
median="0.9" third_q="1.3" max="1.6" min="0.6"/>
</threads>
<threads number="4">
<mops mean="10207.0" stdev="66.8" first_q="10226.6"
median="10227.7" third_q="10250.2" max="10254.9" min="10075.4"/>
<total_time mean="16.9" stdev="0.1" first_q="16.8"
median="16.8" third_q="17.0" max="17.1" min="16.8"/>
<out_of_cpu_time mean="1.6" stdev="0.5" first_q="1.3"
median="1.4" third_q="1.7" max="2.5" min="1.2"/>
</threads>
...
\end{verbatim}
\normalsize
\caption{Another form of stored data shows this beginning of XML file. Instead
of all values obtained from the benchmark run scenario, here are only
aggregated statistical values form the sets of collected values from each
configuration. The aggregated values include minimum, maximum, mean, median,
quartiles and standard deviation of metrics like throughput, run time, or
time in queue for CPU. Those values are directly usable for plotting of
comparison graph without any manipulation with them lowering the time for
generation of reports.}
\label{fig:xml_sums}
\end{figure}
\begin{figure}
\small
\begin{verbatim}
<?xml version="1.0"?>
<beaker_run_result>
<settings>
<BenchmarkName value="NASParallel"/>
<Distribution value="RHEL-7.5"/>
<Kernel value="kernel-3.10.0-862.el7.x86_64"/>
<Architecture value="x86_64"/>
<TunedProfile value="throughput-performance"/>
<SELinux value="Enforcing"/>
...
\end{verbatim}
\normalsize
\caption{Example of XML file with properties of a~benchmark run. In a~flat
structure in node \texttt{<settings>} are key-value pairs stored as node
name and attribute \texttt{value} with the actual value of the property.
There are information including benchmark name, version of kernel and
operating system, hostname and architecture of testing machine and various
system and environment parameters that could affect the performance
measurement results.}
\label{fig:xml_config}
\end{figure}
\subsection{Relational database}
Relational database is a~type of database using relational model. The relational
model stores data in tables using rows for records and columns for their
attributes. Each row represents a~unique record with its attributes in columns.
Columns store values of attributes with the same data type. Records in different
tables can be connected in relationships.
There are many database management systems implementing the relational database
model available under various licenses. From the open-source we can name
PostgreSQL, SQLite, MySQL, or its fork MariaDB. To the category with proprietary
code belong implementations from companies including Oracle, Microsoft, or IBM.
Data in database are managed using SQL (Structured Query Language). It provides
commands for storing, manipulating, and retrieving data. With advanced joining of
tables and filtering it provides wide possibilities of data processing just at
the point of reading of the stored data.
Database offers much faster access to data without complicated parsing of text
files. Searching through the data can be much faster with indexing of the
records by selected columns. Moreover, databases store data more
space-efficiently directly in binary format to avoid unnecessary conversions.
The efficiency of database comes with its disadvantages.
Storing the data in binary form eliminates the possibility of quick insight to
data like with XML files. To access any data, it is required to write an SQL query to
request specific information from the storage.
More complications come with design of the database tables. Different benchmarks
produce different type of result and the requirements for the stored data can change
over time. This requires building universal complex structures or occasional
changes in the database model.
In our team we currently consider the option of storing results of benchmarks in
a~database, which would require a~lot of work needed for migration from the
current storing in XML files.
\section{Visualization the results} \label{sec:visualization}
Effective analysis of results from a~performance measurement requires delivering
the comparison in a~form in which a~human can quickly see the differences in
measured values and their severity.
Visualization offers this advantage against raw text data collected from
benchmarks and performance analysis tools. It allows us to much faster see
important relations between the collected data utilizing often smaller display area
than the raw data.
Right visualization also allows us to deliver more easily understandable data even
for people that do not work with performance analysis on their daily basis.
\subsection{Line graphs}
Line graphs are the simplest method of displaying a~course of values of a~variable
dependent on a~parameter in two dimensional space. It allows to easily spot
nature of the plotted values -- either increasing, decreasing, or constant. The
data can be plotted as discrete values using points or as a~continuous function
using a~line. Comparison of more variables from different datasets is done by
plotting multiple lines to the same graph, one for each variable. In Figure
\ref{fig:linegraph} is an example of line graphs showing CPU utilization of system
NUMA nodes and the ratio of access to memory of remote nodes.
Although line graphs are quick and simple to create and use, they fail to scale
for larger amount of lines in a~single graph. Larger amount of lines becomes too
confusing and impossible to read. Graphs of CPU usage of each NUMA node in
Figure \ref{fig:linegraph} is still readable, but impossible to use for
utilization of every CPU core.
\begin{figure}
\centering
\includegraphics[width=14cm]{numa_behavior}
\caption{An example of line graphs showing statistics collected using the numatop
utility. The graphs in the top row show CPU utilization of each NUMA node
through time and in the graphs bottom row show ratio of access to local memory of
the node and to memory of remote nodes. The graphs on the left show an expected
behavior of a~scheduler causing uniform utilization of each node and a~minimal
amount of access to memory of remote nodes. The utilization graphs use
linear y axis and memory access graphs use logarithmic y axis. The plotted
values were collected on 4 NUMA node machine under workload from the NAS
Parallel benchmark running in 4 threads.}
\label{fig:linegraph}
\end{figure}
\subsection{Heat maps}
Heat maps are three dimensional graphs which are using color as the third
dimension for values. This allows to plot two dependencies of the values
compared to line graphs, which must use multiple lines to plot the same data.
Heat maps provide better scalability for larger data, where line graphs would be
confusing with too many lines. It is also much easier to spot correlation
with the additional dimensions compared to line graphs.
In Figure \ref{fig:heatmap} is a~heat map showing utilization of all CPU cores over
time under workload. The data was collected by the mpstat utility and processed to
show the sum of user and kernel space utilization of each core. Plotting those data
using line graph with a~line for every core would be confusing even for this
relatively small amount of CPUs.
\begin{figure}
\centering
\includegraphics[width=14cm]{heatmap}
\caption{Example of a~heat map showing CPU utilization over time. The machine
with 24 logical CPUs is under a~workload from NAS Parallel benchmark running
in 16 threads.}
\label{fig:heatmap}
\end{figure}
Another use of heat map is shown in Figure \ref{fig:numa_heatmap}. It does not
show utilization of threads, but their location on which NUMA node collected by
ps utility. This heat map shows the migration of threads between NUMA nodes and
the expected result is not the highest value, but minimum of color changes in
each line. The shown data comes from NAS Parallel benchmark, which was run with
16 and 24 threads in 5 loops. The heat map shows better scheduler behavior on
the 16 threads run than on the 24 threads run.
\begin{figure}
\centering
\includegraphics[width=14cm]{numa_heatmap}
\caption{Example of a~heat map showing thread migration between NUMA nodes. The
expected result is not the highest value, but the minimum of color changes in
each line. The shown result comes from a~machine with 24 logical CPUs running
the NAS Parallel benchmark on 16 and 24 threads in 5 loops.}
\label{fig:numa_heatmap}
\end{figure}
\subsection{Box plots}
Box plot is a~method for displaying statistical properties of data from multiple
measurements. It extends the simple visualization of discrete values by adding
to the median values also the minimum and the maximum measured values and the
first and the third quartiles from the measurement.
In some cases the whiskers for minimum and maximum value represent standard
deviation or the 2\textsuperscript{nd} and the 99\textsuperscript{th} percentile. Data
out of the range is displayed as standalone points above or below the whiskers.
Box plots are great to illustrate dispersion and variation of the data that do not
follow exact normal distribution. All of the displayed marks show accuracy and
reliability of measurement. This insight helps to distinguish real performance
regression from a~noise caused by unpredictable behavior of the scheduler and
measurement error.
In Figure \ref{fig:boxplot} is an example of a~box plot showing throughput
measured by the NAS benchmark with different number of threads.
\begin{figure}
\centering
\includegraphics[width=12cm]{boxplot}
\caption{Example of a~box plot showing throughput measured by the NAS
benchmark from multiple measurements with different number of threads. The
last two boxes and their whiskers show us that the range of measured values
is the same despite the big difference in medians. Therefore the plotted
result is treated as passed (without any performance regression).}
\label{fig:boxplot}
\end{figure}
\chapter{Visualization using timelines} \label{ch:timelines}
A common way to analyze performance reports is to compare two results measured
for different versions or settings. Usually, these are called the baseline and
the target results or profiles. The comparison of two results allows one to
interpret the measurement and changes between versions, usually containing clues
to the cause of possible performance changes, e.g. change in some value of
parameter or new code functionality.
However, sometimes it is not enough to compare just two strictly following
versions. If, instead, we analyze a~larger amount of results over a~longer period
of time, we can begin to see a~whole new perspective. In that case there can be
a~much more visible difference between a~deviation from a~measurement error and
a~performance change. It is also easier to find versions where
a~performance degradation occurred and where it was fixed.
With larger amount of data, we can also detect creeping performance drops, which
appeared continuously over a~longer period of time and could not be detected,
because they were within the tolerance due to the measurement deviation.
This chapter proposes new kind of reports with comparison of multiple results of
performance measurement from benchmarks used by Red Hat Kernel Performance
Quality Engineering team. The visualization of performance results helps to see
the performance of Linux kernel scheduler in wider range of time as well as to
determine stability and precision of different benchmarks.
\section{Design}
We mainly focus on performance of kernel versions on a~specific testing machine.
The timelines generator will output HTML reports showing benchmark results of
the desired kernel versions on a~single machine. We describe the mockup of
produced HTML page in Figure \ref{fig:mockup}. Displayed results are specified
using rules for a~base kernel as the first reference result and target kernels
forming the actual timeline.
\begin{figure}
\centering
\includegraphics[width=8cm]{mockup}
\caption{Mockup of timelines report page. The report contains results from
a~single testing machine fulfilling the specified rules. Below the table we
display with the rules we display graphs of the available benchmarks. Each
benchmark section will have a~separate graph for each of its operations and
a~table below the graphs containing featured values from each operation for
each benchmark with links to the results in our result database.}
\label{fig:mockup}
\end{figure}
We propose that the most suitable type of graph for displaying results with
repeated runs are boxplots. Boxplots show important statistical values from the
runs: the median, the minimum, the maximum, and the first and the third
quartiles. Those values can quickly reveal stability of the plotted benchmark and
noise in the measurement that can help to distinguish true performance
degradations.
Each graph of benchmark operation contains results from all thread
configurations, but only one desired configuration is visible by default.
The desired number of threads is the point where performance regressions create
the biggest difference, which is in most cases the highest number of threads.
The graphs will also contain horizontal lines in background following median of the
base result and its value increased and decreased by 5\%. Those lines will allow
more effective recognition of significant performance changes without looking at
the absolute values of the measurements.
Under the graphs of all operations of benchmark will be a~table containing medians
of featured thread runs for each benchmark operation of every displayed kernel
for browsing of the absolute result values. Each record will also work as a~link
to result record in the database of benchmark results of Red Hat Kernel
Performance QE team.
\section{Implementation}
This section describes selected aspects of implementation of the timelines
report generator. The generator is implemented in Python2 due to earlier origins
of its implementation.
\subsection{Comparison rules}
For automatic report generation it is essential to allow defining rules, which
will specify results, that can be used and in which role (i.e. whether they
correspond baseline or target). We propose to use \emph{regular expressions} to
match properties of results. Regular expressions offer broad possibilities to
describe shape of kernel version or the value of any environment configuration.
E.g. to filter all builds of kernel 4.18 we can use simple pattern
\texttt{kernel-4.18\textbackslash ..*}.
We store the rules in an XML file with the same node naming as in the XML file
with the result properties. The first level of XML document contains three nodes
representing the purpose of the rules.
\begin{itemize}
\item \textbf{Baseline rules} specify the first result in the plotted set.
These act as the main result that the others are compared to. In case of
multiple results fitting the rule, the newest one will be used.
\item \textbf{Target rules} define the results to be plotted.
\item \textbf{Starting rules} (optional) are for the case, when base result is
not from the set of target results and there is need of specifying the first
target result would be hard with regex.
\end{itemize}
\subsection{Reading of results}
Benchmark results are in our case stored in the filesystem in directories. The
generator has to go recursively through directory with results of benchmark runs
with files containing desired data. From each result, it starts with file
containing properties of given result. This file provides metadata from the
benchmark run including the time, machine hostname, kernel and OS version,
benchmark name, configuration of environment for selecting desired results using
the comparison rules. Example of the XML file with properties is in Figure
\ref{fig:xml_config}.
After applying the rules the generator reads files from
selected results with preprocessed data that are ready to use for drawing box
plot graphs. Example of the file with preprocessed values is in Figure
\ref{fig:xml_sums}. Parsed data from this XML file is all the generator needs to
start drawing the timeline graphs.
\subsection{Plotting of timelines}
For displaying the data in graphs using box plots we use Highcharts
JavaScript library\;\cite{highcharts} for generating interactive SVG-based
graphs for HTML pages. It offers free license for non-commercial use with
available source code, but also paid licenses for commercial use. It provides
fast and easy creation of various types of graphs including box plots with wide
possibilities of customization.
On the HTML page are the charts rendered by JavaScript function from the
Highcharts library to specified \texttt{div} element by its id attribute as an
SVG image. The rendering function takes parameters for the graph in JSON format
including type of chart, axis parameters, horizontal lines for reference median,
interactive tooltips, and the data series.
\subsection{Generator parallelization}
With multiple entered sets of rules for timeline reports the generator can
create multiple reports, e.g. separate reports for Red Hat Enterprise Linux 7
and 8. The procedure of parsing results and generating graphs is more on CPU
than IO operations.
To reduce time needed to generate all reports multiprocessing is used. There is
no need of exclusive access to any shared resource, because all of them are
used only for reading, not writing. For implementation Python2 provides
\texttt{multiprocessing} library with \texttt{Process} class that creates child
process by forking allowing true parallel generation of reports. Using only
threads in Python does not lead to parallel computation because of Python's
\emph{global interpreter lock} is preventing multiple threads from executing
code simultaneously.
\section{Generated output}
Output of the timeline generator is HTML report containing graphs of different
benchmark operations with summary tables as described in mockup in Figure
\ref{fig:mockup}.
The graphs use box plot method which shows the measurement accuracy. This
feature helped to eliminate unstable benchmark operations from run scenarios.
One of the unstable operation is shown in Figure \ref{fig:timeline_unstable}.
Thanks to wide range of displayed results it is easy to find version where
a~performance regression occurred, or was fixed. This case is captured in Figure
\ref{fig:timeline_jvm_table}, where is also shown part of the summary table. The
cause of the degradation is change in memory migration rules between NUMA nodes.
However, this change was not pure regression, because other benchmark shown
significant gain displayed in Figure \ref{fig:timeline_nas}.
\begin{figure}
\centering
\includegraphics[width=15cm]{timeline_unstable}
\caption{Example of very unstable benchmark which was removed from the run
scenario because of irrelevant results. The yellow and orange lines show
median of base result and 5\% differences which are usually threshold for
suspected performance regression.}
\label{fig:timeline_unstable}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=15cm]{timeline_jvm_table}
\caption{Generated box plot timeline graph of \emph{mpegaudio} operation from
SPECjvm2008 benchmark with part of table showing numeric results of the
SPECjvm2008 operations. The table shows medians from benchmark runs with
featured number of threads and the kernel names are links to our team
database of results. It displays performance results of upstream kernels for
Red Hat Enterprise Linux 8. The performance degradation is consequence of
changes in memory placement and migration behavior.}
\label{fig:timeline_jvm_table}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=15cm]{timeline_nas}
\caption{Timeline box plot from the same report as Figure
\ref{fig:timeline_jvm_table} showing results of \emph{Embarrassingly
Parallel} operation from NAS Parallel benchmark. This benchmark got
performance gain in the same version as the SPECjvm2008 benchmark registered
performance degradation. Fortunately, this performance gain in NAS Parallel
benchmark operation did not disappeared after fixing the degradation at
SPECjvm2008 benchmark.}
\label{fig:timeline_nas}
\end{figure}
\chapter{Automatic evaluation} \label{ch:ai}
With every new release of regular kernel and its testing the amount of produced