-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy paththesis.tex
4205 lines (3596 loc) · 246 KB
/
thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[11pt]{book}
\usepackage{fullpage}
\usepackage{amsmath}
\usepackage{mathtools}
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{cite}
\usepackage{times}
\usepackage{url}
\usepackage{setspace}
\usepackage{fancyhdr}
\usepackage{ifthen}
\usepackage{listings}
\usepackage{color}
\usepackage[section]{placeins}
\usepackage{xtab}
\usepackage{url}
\usepackage[vlined]{algorithm2e}
\usepackage{float}
\usepackage{multirow}
\usepackage{paralist}
\usepackage{interval}
\usepackage{wrapfig}
\usepackage{grffile}
\usepackage{tocloft}
\usepackage[
colorlinks=true,
linkcolor=blue,
urlcolor=cyan,
backref=page,
pdftitle={Doctoral Dissertation of Sounak Gupta}
]{hyperref}
\setcounter{topnumber}{8}
\setcounter{bottomnumber}{8}
\setcounter{totalnumber}{8}
\renewcommand{\topfraction}{0.5}
\renewcommand{\bottomfraction}{0.95}
\renewcommand{\textfraction}{0.1}
\renewcommand{\floatpagefraction}{0.7}
\setlength{\abovecaptionskip}{3pt}
\setlength{\belowcaptionskip}{3pt}
\pagestyle{fancy}
\setboolean{@twoside}{false}
\setlength{\headsep}{25pt}
\setlength{\headheight}{14pt}
\setlength{\cftfignumwidth}{3.15em}
\definecolor{dkgreen}{rgb}{0,0.6,0}
\definecolor{gray}{rgb}{0.5,0.5,0.5}
\definecolor{mauve}{rgb}{0.58,0,0.82}
\lstset{frame=tb,
language=C++,
columns=flexible,
basicstyle={\linespread{0.9}\small\ttfamily},
breaklines=true,
captionpos=b,
aboveskip=0.2in,
belowskip=0.2in,
numberstyle=\tiny\color{gray},
keywordstyle=\color{blue},
commentstyle=\color{dkgreen},
stringstyle=\color{mauve},
tabsize=4
}
\newcommand{\plotTrial}[3]{
\begin{figure}
\begin{subfigure}{\textwidth}
\centerline{\includegraphics[height=.4\textheight]{data/#1_Simulation_Runtime_(secs.)}}
\subcaption{Simulation Runtime (in seconds)}
\end{subfigure}
\end{figure}
\begin{figure}
\ContinuedFloat
\begin{subfigure}{\textwidth}
\centerline{\includegraphics[height=.4\textheight]{data/#1_Event_Commitment_Ratio}}
\subcaption{Event Commitment Ratio}
\end{subfigure}
\end{figure}
\begin{figure}
\ContinuedFloat
\begin{subfigure}{\textwidth}
\centerline{\includegraphics[height=.4\textheight]{data/#1_Event_Processing_Rate_(per_sec)}}
\subcaption{Event Processing Rate (per second)}
\end{subfigure}
\caption{#2}\label{#3}
\end{figure}
}
\newcommand{\plotOverallNUMA}[3]{
\begin{figure}
\begin{subfigure}{\textwidth}
\centerline{\includegraphics[height=.4\textheight]{data/#1/plots/top_performers}}
\subcaption{Top Performers}
\end{subfigure}
\end{figure}
\begin{figure}
\ContinuedFloat
\begin{subfigure}{\textwidth}
\centerline{\includegraphics[height=.4\textheight]{data/#1/plots/worst_performers}}
\subcaption{Worst Performers}
\end{subfigure}
\caption{#2}\label{#3}
\end{figure}
}
\newcommand{\plotOverallSMP}[3]{
\begin{figure}
\centerline{\includegraphics[height=.4\textheight]{data_smp/#1/plots/top_performers}}
\caption{#2}\label{#3}
\end{figure}
}
\newcommand{\plotTrialNoLabel}[2]{
\begin{figure}
\begin{subfigure}{.85\textwidth}
\centerline{\includegraphics[width=.85\textwidth]{data/#1_Simulation_Runtime_(secs.)}}
\subcaption{Simulation Runtime (in seconds)}
\end{subfigure}
\end{figure}
\begin{figure}
\ContinuedFloat
\begin{subfigure}{.85\textwidth}
\centerline{\includegraphics[width=.85\textwidth]{data/#1_Event_Commitment_Ratio}}
\subcaption{Event Commitment Ratio}
\end{subfigure}
\end{figure}
\begin{figure}
\ContinuedFloat
\begin{subfigure}{.85\textwidth}
\centerline{\includegraphics[width=.85\textwidth]{data/#1_Event_Processing_Rate_(per_sec)}}
\subcaption{Event Processing Rate (per second)}
\end{subfigure}
\caption{#2}
\end{figure}
\begin{figure}
\ContinuedFloat
\begin{subfigure}{.85\textwidth}
\centerline{\includegraphics[width=.85\textwidth]{data/#1_Speedup_w.r.t._Sequential_Simulation}}
\subcaption{Speedup w.r.t. Sequential Simulation}
\end{subfigure}
\end{figure}
\FloatBarrier
}
\begin{document}
\raggedbottom
\thispagestyle{empty}
\doublespacing
%% TITLE page format based on thesis guidelines mentioned in
%% https://grad.uc.edu/student-life/etd/page_order.html
\begin{center}
\textbf{
\LARGE Pending Event Set Management\\
in\\
Parallel Discrete Event Simulation\\
}
\vspace*{0.4in}
{\large A dissertation submitted to the\\[0.20in]
Graduate School\\
of the University of Cincinnati\\[0.20in]
in partial fulfillment of the\\
requirements for the degree of\\[0.20in]
\textbf{\Large Doctor of Philosophy}\\[0.20in]
in the Department of Electrical Engineering and Computer Science\\
of the College of Engineering and Applied Sciences\\[0.20in]
June 18, 2018\\[0.20in]
by\\[0.20in]
\textbf{Sounak Gupta}\\
B.E. , Bengal Engineering and Science University, Shibpur\\
June 2009\\
}
\vspace{0.4in}
{\large Committee Chair: Philip A. Wilsey , Ph.D.}
\end{center}
\clearpage
\setcounter{page}{1}
\pagenumbering{roman}
\clearpage
\chapter*{Abstract}\label{chapter:abstract}
In Parallel Discrete Event Simulation (PDES), the \emph{pending event set} refers to the set of events
available for execution. These pending events are aggressively scheduled for execution in a Time Warp
synchronized parallel simulation without strict enforcement of the causal relationship between events
\cite{fujimoto-00,jefferson-85}. For most discrete event simulation models, event processing granularity is
generally quite small. On many-core and multi-core platforms, this decrease in granularity aggravates
contention for the shared data structures which store these pending events. As the number of cores increase,
a key challenge lies in providing effective, contention-free event list management for scheduling events.
\emph{Lock contention}, \emph{sorting}, and \emph{scheduling order} are the prime contributors to contention
for access to the pending events set.
Possible solutions to this problem include atomic read-write operations \cite{gupta-14}, hardware
transactional memory \cite{hay-15,santini-15}, or synchronization-friendly data structures
\cite{dickman-13,gupta-17}. The focus is on choosing efficient data structures for the pending event set and
optimization of scheduling techniques that can improve the performance of the Time Warp synchronized parallel
simulation. The following design concepts for optimizing the pending event set are explored in this
dissertation:
\begin{enumerate}
\item an exploration of a variety of different data structures that are commonly used in the management of
pending event set. In addition the management of the pending event set using a \emph{Ladder
Queue}\cite{tang-05} data structure is explored. The Ladder Queue forms a self-adjusting hierarchically
partitioned priority queue that makes it particularly attractive for managing the pending event set;
\item the elimination of sorting within the Ladder Queue partitions. Events are then scheduled from the
lowest partition without concerns for their time order and causal independence of these events is assumed;
\item an atomic read-write access to the the Ladder Queue partition that holds the smallest available events
is explored;
\item Objects containing groups of events are organized in the pending event set. Each group is dequeued
with one access and the collection of events are scheduled. Several different options for the definition
of these groups and how these groups are scheduled is explored.
\end{enumerate}
Experimental results show that fully-sorted Ladder Queue is over 20\% faster than STL MultiSet and Splay Tree.
It was also observed that Ladder Queue with unsorted lowest partition is at least 25\% faster than
fully-sorted Ladder Queue for all discrete event simulation models studied. The atomic read-write access to
this lowest partition of the Ladder Queue allows simulation models to run 25-150\% faster than the
fully-sorted Ladder Queue. Furthermore, studies indicate that scheduling events in groups provides up to
100\% gain in performance for some simulation models. The experiments also show that the modularity-based
partitioning of LPs makes a simulation run as fast as Ladder Queue with atomic read-write operations for some
kernel and model configurations. Overall Ladder Queue with atomic read-write access to its unsorted lowest
partition has the best performance among the scheduling data structures studied and multiple schedule queues
is the choice for scheduling strategy. Combining the two together results in the most effective scheduling
mechanism. There is over 100\% boost in performance for some models when this combination is compared to any
of the other efficient configurations mentioned above.
Each of the aforementioned concepts explored address the contention to the shared data structures that store
the pending event set. The work on \emph{group} and \emph{bag-centric} scheduling draws inspiration from the
study on profile-driven data collected from discrete event simulations \cite{wilsey-16}. This study helped
guide the design and optimization of scheduling strategies for the pending event set in a Time Warp
synchronized parallel simulation.
%% BLANK page inserted based on thesis guidelines mentioned in
%% https://grad.uc.edu/student-life/etd/page_order.html
\clearpage
\thispagestyle{empty}
\hfill
\clearpage
\clearpage
\chapter*{Acknowledgements}
First and foremost, I would like to express my sincere gratitude to my advisor Dr.\ Wilsey for his constant
guidance, encouragement, and support throughout my journey as a graduate student. His deep insights into
the dynamics of parallel simulation, data-driven approach to research, and exhaustive review of my drafts
have motivated me to improve as a researcher.
I would like to thank the rest of my thesis committee, namely Dr.\ Fred Beyette, Dr.\ Ali Minai, Dr.\ Carla
Purdy and Dr.\ Nael Abu-Ghazaleh, for their insightful comments and questions which have encouraged me to
analyze my research from a broader perspective.
I also take this opportunity to thank my fellow lab-mates and collaborators, especially Tom Dickman and
Doug Weber.
This acknowledgement would remain incomplete without any mention about my parents. I am grateful to them
for giving me a happy childhood and a good education. I am also grateful to my fianc\'ee Anuja Roy for
patiently tolerating my chaotic working hours and geekish hobbies. I also thank my cousin, Kingshuk Mallik,
for those pointless intellectual debates and help with my graduate applications and formatting of this
thesis.
Last but not least, this journey would not have been memorable and enjoyable without friendly fellow
travellers. I would like to thank them, especially Saibal Ghosh, Souvik Sarkar, Abhrojyoti Mondal, Sauradeep
Bhowmick and Dr.\ Maitreya Dutta (in no particular order).
\tableofcontents \markright{ }
\listoffigures \markright{ }
\listoftables \markright{ }
\listofalgorithms \markright{ }
\clearpage
\pagenumbering{arabic} \setcounter{page}{2}
\chapter{Introduction}\label{chapter:introduction}
This principal focus of this dissertation is \emph{Pending Event Set Management} for Parallel Discrete Event
Simulation (PDES) running on stand-alone multi-core processors. A Discrete Event Simulation (DES) models the
behavior of a physical system whose state can change in discrete time intervals due to stimulus provided by
events generated within the system. Events in DES are sequentially processed on a single processor and this
unacceptably slow for large simulation models. Parallel Discrete Event Simulation (PDES)
\cite{fujimoto-90,fujimoto-00} is an alternative to sequential DES that uses parallel algorithms to
efficiently execute and synchronize a DES on a parallel compute platform. There are several possible
synchronization mechanisms for PDES; some with a centralized mechanism and others with a distributed
synchronization mechanism. This thesis explores the design of the pending event set for a PDES synchronized
using the Time Warp Mechanism \cite{jefferson-85,fujimoto-90}. Time Warp is a distributed synchronization
mechanism that can be used with a variety of parallel processing architectures such as shared-memory
multiprocessors, clusters, or other systems that support parallel execution \cite{culler-97,patterson-11}.
The rapid growth of multi-core and many-core processors in recent years is encouraging programmers to explore
the benefits of hardware support for software-driven parallelism. However, the scalability of massively
parallel software systems is limited by contention among threads for shared resources. A software system
designed for processors with low number of cores will not necessarily scale up proportionately when executed
on a many-core platform. This contention issue becomes particularly acute during scaling up of Time Warp
synchronized \cite{jefferson-85, jefferson-87} PDES on many-core processors.
Time Warp is a checkpoint and rollback recovery mechanism that allows a discrete event simulation to process
events greedily without explicit synchronization. It operates under the assumption that events will mostly
arrive in time for the parallel simulator to process them in their intended order and, therefore, minimize the
frequency of rollbacks. The events in a simulation are causally ordered based on their \emph{timestamp}. For
any given architecture, the `critical path' is a reference to the time needed to execute all events using any
conservative simulation mechanism (\emph{i.e.} without any causal violations) \cite{jefferson-91}. This
`critical path' is, therefore, the lower bound on the time against which all parallel simulations can be
compared. Due to the relatively fine-grained computational nature of most discrete event simulation models,
efficient shared access to the \emph{pending event set} scheduling mechanisms is of paramount importance to
obtaining peak performance from the parallel simulation engine. In particular, the shared data structures
holding the pending event set data and the mechanisms of safe access to them can have significant impact on
the overall performance of the simulation system. This is especially critical in shared memory, many-core
processing systems.
In light of these developments, the objectives to peak performance are to minimize contention to the shared
data structures that hold the pending event set, while also allowing available events to be scheduled close to
the critical path in order to minimize rollbacks. Lock contention, sorting, and order of event execution are
the key aspects that amplify contention for the shared data structures of the pending event set. The focus
here is to explore the limits of intra-node distributed event processing so that a Time Warp synchronized
parallel simulation can deliver scalable speedups with minimal overall rollback.
The significant avenues explored in this dissertation are:
\begin{itemize}
\item \textbf{Ladder Queue}: The \emph{Ladder Queue} \cite{tang-05} is a variant of the well-known
\emph{Calendar Queue} \cite{brown-88}. It is a novel data structure that can self-select the partition size
for the months of the calendar. This sizing problem is one of the main challenges that makes traditional
Calendar Queues difficult to employ. Thus, the \emph{Ladder Queue} \cite{tang-05} potentially provides the
performance benefits of a \emph{Calendar Queue} without the complicated issues accompanying the sizing of
time interval boundaries for each \emph{month (or partition)}. Dickman \emph{et al} \cite{dickman-13}
proposed that the Ladder Queue can efficiently hold and schedule the smallest timestamped events from each
Logical Process (LP) assigned to the multi-processor node. The `ladder' structure also presents some
additional opportunities for design space optimization that have been outlined in the next two bullets. A
detailed description of these design space optimizations is presented in Section \ref{subsec:ladderq} of
this dissertation.
\item \textbf{Ladder Queue with Unsorted Bottom}: The Ladder Queue organizes stored events into partitions
that can be arranged similar to rungs on a ladder. Conventionally, the Ladder Queue does not sort events
within a rung partition until the events in a particular partition are needed for execution. Since these
rungs partition events within fairly small time intervals, Gupta and Wilsey \cite{gupta-14} proposed that
events in the lowest partition can be processed without sorting. The hypothesis is that pending events
contained within a rung are highly likely to be causally independent and they can therefore be scheduled for
execution without further sorting based on their timestamp \cite{gupta-14}. The details of this
implementation/variation of the Ladder Queue are presented in Section \ref{subsec:unsorted_bottom}.
\item \textbf{Ladder Queue with Lock-Free Access to Unsorted Bottom}: Partitioning the pending event set into
a \emph{Ladder Queue} data structure provides direct access to the lowest timestamped events from the LPs.
However, since the \emph{Ladder Queue} is a shared data structure, the locking costs and the fine-grained
nature of discrete event simulation can trigger considerable contention and negatively impact overall
performance. Gupta and Wilsey \cite{gupta-14} proposed that atomic operations can be used to avoid lock
contention for the lowest time window in a \emph{Ladder Queue}. The implementation details of this are
presented in Section \ref{subsec:lockfree}.
\item \textbf{Group Scheduling}: A traditional parallel simulation engine will schedule events one at a time
for execution. Results from a related project on profiling discrete event simulation (called
\emph{DESMetrcs}) \cite{wilsey-16} suggests that there may be significant causal independence between the
first few events in each LP's input event queue. Gupta and Wilsey \cite{gupta-17} examined this concept
with a preliminary implementation and defined two different mechanisms for defining \emph{event groups} to
help reduce the frequency of accesses to the pending event set. This approach further extends \emph{causal
independence} among events at the head of the pending event set (which contains the lowest timestamped
events). The hypothesis is that scheduling groups of events at a time will help minimize contention to the
pending event set. Two different group scheduling strategies were explored, namely: \emph{block scheduling}
and \emph{chain scheduling}. The design details for event groups is presented in Sections
\ref{subsubsec:chain_scheduling} and \ref{subsec:block_scheduling}.
\item \textbf{Profile-driven Bag Scheduling}: In another study by the DESMetrics group, Crawford \emph{et al}
\cite{crawford-17} examined the modularity of LPs based on events communicated between the LPs in a Discrete
Event Simulation. Their results suggest that it is possible to partition the LPs into groups with higher
inter-LP event exchanges (which mirrors results from other empirical studies \cite{alt-14}). This tighter
coupling also suggests that it might be possible to expand the group scheduling of events as outlined in the
above bullet based on their modularity. This novel group scheduling practice organizes the lowest
timestamped input events from each LPs into a bag for scheduling. A detailed study of this approach is
contained in Section \ref{subsec:bags}.
\end{itemize}
\section{Thesis Overview}
\label{sec:overview}
The remainder of this dissertation is organized as follows:
\paragraph{Chapter \ref{chapter:background}} a description of background information on parallel
simulation and parallel computing. This knowledge is needed to understand the core concepts and ideas put
forward in this dissertation.
\paragraph{Chapter \ref{chapter:related_work}} overviews several well-known parallel
discrete event simulation kernels that are based on the Time Warp-based synchronization
protocol. These implementations have significant features that have somewhat contributed to the overall design
of the \textsc{warped2} simulation kernel which is the target platform used in this dissertation.
\paragraph{Chapter \ref{chapter:warped2_overview}} provides a detailed description of
the \textsc{warped2} the Time Warp-synchronized Parallel Discrete Event Simulation kernel developed by
Dr.\ Philip A.\ Wilsey and his students at University of Cincinnati. The software architecture is discussed
in detail and it serves as the experimental foundation for the ideas on organizing the \emph{pending event
set} that is explored in this research and described in Chapter \ref{chapter:event_scheduling}.
\paragraph{Chapter \ref{chapter:event_scheduling}} explains the core ideas and hypothesis are examined in this
dissertation for implementing efficient data structures and algorithms within a Time Warp synchronized
parallel simulation kernel executing on shared memory, many-core processors.
\paragraph{Chapter \ref{chapter:experiments}} presents an overview of the experimental setup and
details of the simulation model benchmarks. Performance data from these experiments along with
detailed analysis is also included in Section \ref{sec:perf_analysis}.
\paragraph{Chapter \ref{chapter:comparison_to_sequential}} presents a consolidated overview of the most
effective scheduler configurations found from the quantitative study in Chapter \ref{chapter:experiments}.
\paragraph{Chapter \ref{chapter:conclude}} presents some final concluding remarks and suggestions
for future research.
\chapter{Background}\label{chapter:background}
An overview of Parallel Discrete Event Simulation (PDES) and its Time Warp synchronization mechanism is
contained in this chapter. In addition to the specific paper citations contained in this chapter, an
excellent survey and book on PDES are also available \cite{fujimoto-90,fujimoto-00}.
\section[\textsc{des}]{Discrete Event Simulation}\label{sec:des}
There are many types and modeling styles for simulation and Figure \ref{fig:des} illustrates the broad
categories of these. A physical system that is modelled as a sequence of events that are created at discrete
time intervals is called \emph{Discrete Event Simulation (DES)} \cite{law-00}. The work in this dissertation
will focus on the efficient implementation of the event processing engine in a parallel simulation kernel that
is designed to execute DES models.
\begin{figure}
\centering
\includegraphics[width=0.5\textwidth]{figures/des.pdf}
\caption{Discrete Event Simulation}\label{fig:des}
\end{figure}
A \emph{Discrete Event Simulation} model consists of three main data structures, namely:
\begin{description}
\item[Simulation clock:] that records the simulated time of the model under study. This clock can be used for
two primary purposes:
\begin{itemize}
\item measuring the progress of the simulation, and
\item determining the order of events to be processed.
\end{itemize}
\item[State variables:] The snapshot of the simulated system at any specific point in time (otherwise called
state) can be described accurately by these set of variables.
\item[Pending event set:] A set of events that are yet to be processed.
\end{description}
\noindent
A physical system can be described by a set of physical processes grouped together to form a \emph{Simulation
Model}. Each of these physical processes corresponds to a \emph{Logical Process} (LP). The LPs communicate
among themselves using timestamped events. The timestamp marks the simulation time used for ordering and
processing of events. The state of the system is updated only when an event is processed.
A \emph{Sequential Discrete Event Simulation} is a type of simulation where events are globally sorted and
processed sequentially one at a time. A list holds all the events that are waiting to be processed and sorts
them in increasing order of timestamp. The event with the lowest timestamp is processed first. The state of
the system is updated and the simulation clock advances every time an event is processed. It is also likely
that one or more new events are produced when an event is processed. This sequential mechanism is too slow
and inefficient for large simulations and can be improved with parallelization.
\section[\textsc{pdes}]{Parallel Discrete Event Simulation}\label{sec:pdes}
The method for running a discrete event simulation on a parallel computing platform is called Parallel
Discrete Event Simulation (PDES). The parallel computing platform can be a shared memory multiprocessor, a
distributed memory cluster, or a combination of both. A synchronization mechanism is used to coordinate
parallel processing of events from different LPs while preserving the causal order for the final result
\cite{lamport-78}. The \emph{Time Warp Mechanism} \cite{jefferson-85,fujimoto-90,fujimoto-00} is one such
synchronization mechanism that we will explore further in this dissertation. It is an \emph{optimistic
synchronization mechanism} which does not strictly preserve the causal order of events at all times and
sometimes temporarily allows violation by the simulation executive. The next subsection provides a brief
overview of the Time Warp mechanism.
\subsection{Time Warp}\label{subsec:timewarp}
The Time Warp mechanism is an optimistic method of simulation. It is based on the \emph{Virtual Time}
paradigm \cite{jefferson-85} which is a method for ordering events in distributed systems without requiring
any knowledge of real time. \emph{Virtual Time} is looked upon as the simulation time in case of parallel
discrete event simulation.
Events from different LPs can be processed independently in an optimistically synchronized mechanism such as
Time Warp. No consideration is required for situations when lower timestamped events are received
\emph{i.e.,} when the causal order of event is violated. Figure \ref{fig:causality_violation} depicts a
causality violation scenario by showing the timing diagram for events in three LPs. An event with
\emph{timestamp = 3} is processed by $LP_1$ and two events are generated as a result, one with \emph{timestamp
= 8} is sent to $LP_2$ and another with \emph{timestamp = 5} is sent to $LP_3$. $LP_3$, on processing this
received event, generates an event for $LP_2$ with \emph{timestamp = 6}. This event has a lower receive time
than the previous event (\emph{timestamp = 8}) received by $LP_2$. If this previously received event is
processed by $LP_2$ before the event with lower receive time is received, it would lead to incorrect update of
the state as well as new events could be sent to itself or other LPs incorrectly.
\begin{figure}
\centering
\includegraphics[width=0.4\textwidth]{figures/causality.pdf}
\caption{Violation of Event Causality}\label{fig:causality_violation}
\end{figure}
\emph{Rollback} is the process of undoing the effects of incorrectly processed event(s) when a causality
violation is detected at an LP\@. \emph{Straggler event} is that incoming event with a timestamp lower than
the LP simulation clock (the LP has advanced prematurely). The straggler event denotes a causality violation
and triggers a rollback. According to Jefferson \cite{jefferson-85}, three data structures per LP are
necessary to handle rollbacks: \begin{inparaenum} [(1)] \item Input Queue, \item Output Queue, and \item State
Queue\end{inparaenum}. Figure \ref{fig:lp_data_structures} shows the Input Queue which contains all past
and future event, the Output Queue contains all previously sent events, and the State Queue holds snapshots
of previous states.
\begin{figure}
\centering
\includegraphics[width=0.4\textwidth]{figures/lp_data_structures.pdf}
\caption{Data Structures of Logical Process in Time Warp}\label{fig:lp_data_structures}
\end{figure}
On rollback, the state of an LP is restored back to a snapshot that was saved at a timestamp prior to the
straggler event's receive time. The Output Queue keeps track of all events, with send time greater than the
straggler event, that were sent incorrectly. Those events are are re-sent but this time as an
\emph{anti-message} or \emph{negative event}. An anti-messages is a copy of the positive event that has
already been sent, but with a crutial differentiator --- a bit within the event payload is set to indicate it
is a negative event. The receiving LP does not process an anti-message in the same way as it processes a
positive event. The anti-message's job is to cancel out the positive counterpart in the input queue. The
anti-message can also be a straggler if it causes causality violation for the receiving LP. In that scenario,
a rollback is triggered in the manner mentioned above. This recursive process stops when the simulation is
rid of causality violations.
The minimum timestamp of all unprocessed events and anti-messages at any given point during the simulation is
called the \emph{Global Virtual Time} (GVT) of the simulation. Events that have been sent but not yet
received are also tracked for this metric \cite{fujimoto-94}. Timestamp of the smallest unprocessed event in
an LP is referred to as its \emph{Local Virtual Time} (LVT). Though irrelevant as a metric on its own,
\emph{LVT} is essential to determine the \emph{GVT} in a distributed environment. \emph{GVT} calculation is a
complex problem of estimation. Several algorithms for GVT estimation have been discussed in \cite{weber-16}.
Estimation of \emph{GVT} allows us to fix a lower bound on how far back in time a rollback can occur. This is
useful for selective elimination of events from the input and output queues and the states from the state
queue. Only those events and states that have timestamp lower than GVT are eliminated since those will never
be used again. This step is useful for memory reclamation as well as to commit I/O operations that cannot be
undone. The process is called \emph{fossil collection} and it does not necessarily have to be GVT-based.
There are several other methods of fossil collection as discussed in \cite{weber-16}.
\section[\textsc{architecture}]{Architecture of Parallel Processing Systems}\label{sec:parallel_processing_architecture}
Parallel processing systems \cite{culler-97,patterson-11} are generally characterized based on the manner
in which processors and memories are grouped together. A single machine which shares a common address
space for all processes is called a \emph{shared memory} system. These systems can either have a single
physical memory unit or multiple memory units. Multiple machines with separate process address connected
over interconnected network is called a \emph{clustered system}.
\subsection{Shared Memory Multiprocessors}\label{subsec:shared_memory_multiprocessors}
\paragraph{Symmetric Multiprocessor (SMP)} consists of a group of identical processing cores connected to a
single shared memory with full access to the input/output devices in the system. This arrangement allows
uniform time access to the shared memory from each core. In general, this architecture fails to scale up
linearly with increasing core counts due to the steep increase in memory contention. As a results, the number
of processor cores in these configurations is generally bounded by some small number (generally $\leq$ 16,
although this number continues to increase). Figure \ref{fig:smp} illustrates a typical \emph{SMP} system.
\begin{figure}
\centering
\includegraphics[width=0.45\textwidth]{figures/smp.pdf}
\caption{A Symmetric Multiprocessor}\label{fig:smp}
\end{figure}
\paragraph{Non-Uniform Memory Access (NUMA)} processor has multiple cores and multiple memories with variable
access times to different memory locations based on the transport path between the processing cores and the
memory module. In general, each core has a local memory module that provides fast access to its contents
while access to a non-local memory module (generally local to some other core) is accessed with a longer
access time. Due to the higher access time for remote memory units, the common programming practice is to
maximize local memory accesses for a processor. Compared to \emph{symmetric multiprocessors}, contention to
single memory does not necessarily increase when number of processors is increased in \emph{NUMA} systems. As
a result, the latter system is capable of larger scale-up than SMP systems. Figure \ref{fig:numa} illustrates
a typical NUMA system.
\begin{figure}
\centering
\includegraphics[width=0.75\textwidth]{figures/numa.pdf}
\caption{A Non-Uniform Memory Access System}\label{fig:numa}
\end{figure}
\subsection{Clustered Systems}\label{subsec:clustered_systems}
A \emph{Beowulf Cluster} (or simply cluster) is a loosely coupled set of machines (also called nodes)
connected together over a local network. It is designed to appear as a single machine to the user. All nodes
on the cluster execute the same program concurrently by launching multiple processes on each machine. A
cluster allows the program currently being processed to communicate among processes using some type of message
passing. This message passing is generally handled by a parallel communication software such as \emph{Message
Passing Interface (MPI)} \cite{gropp-94} or \emph{Parallel Virtual Machine (PVM)} \cite{geist-94}. Figure
\ref{fig:beowulf} illustrates a commonly used version of the \emph{Beowulf cluster}.
\begin{figure}
\centering
\includegraphics[width=0.75\textwidth]{figures/beowulf.pdf}
\caption{Beowulf Cluster}\label{fig:beowulf}
\end{figure}
\section[\textsc{Communication}]{Communication in Parallel Systems}\label{sec:parallel_sys_comm}
Any application capable of parallel and independent execution on multiple processes/threads may
need to exchange information between these processes/threads. Processes/threads can communicate
amongst themselves in either of these two ways:
\begin{itemize}
\item \emph{Message Passing}: explicitly pass messages to each other using well defined message formats, or
\item \emph{Shared Memory}: shared data structures accessible to all processes/threads.
\end{itemize}.
\noindent
There are fundamental differences between these two communication methods and both have their strengths and
weaknesses. Sections \ref{subsec:msg_passing} and \ref{subsec:shared_memory} explore these two communication
models in further details.
\subsection{Message Passing}\label{subsec:msg_passing}
Processes communicate only through serialized messages in a message passing system. These messages can be
passed either in \emph{synchronous mode} or in \emph{asynchronous mode}. To enable the sender and receiver to
communicate via serialized messages, the format of the messages must be pre-defined.
\paragraph{Synchronous message passing} requires a specific ordering of the send/receive operations
for each process so that the sender/receiver processes can operate together in synchronized fashion. Until a
message is received by the receiver, a sender's send operation will remain blocked. Similarly, a receiver's
receive operation will remain blocked until the sent message is fully received. These strict rules make every
process follow a predictable and synchronized communication pattern. The whole system might suffer from a
slow down since processes are not allowed to execute other operations while communication operations are
ongoing.
\paragraph{Asynchronous message passing} allows processes to continue with routine executions
without blocking immediately after the start of send/receive operations. All pending operations are held in
temporary queues and can be processed at any time and in any order. As a result, these processes do not have
to follow a predictable communication pattern when communicating in asynchronous mode.
If the workload can be partitioned sensibly to keep remote communication at a minimum, the number of
cooperating processes can be scaled to essentially any size. This is the chief advantage of \emph{message
passing}. These processes can utilize separate address spaces on different machines while communicating
over the local network. However, it takes time to serialize, deserialize, and copy messages. This latency is
significantly higher when compared to the processing speed of modern processors. This latency is the chief
disadvantage of remote messaging. There might be further delays when processes use an interconnection network
because of extra communication latency. Fine-grained parallel applications need to send and receive lots of
small messages which is not something message passing is ideally suited. Message passing is illustrated
in Figure \ref{fig:message_passing}.
\begin{figure}
\centering
\includegraphics[width=0.9\textwidth]{figures/message_passing.pdf}
\caption{Message Passing Communication Model}\label{fig:message_passing}
\end{figure}
\subsubsection{Message Passing Interface (MPI)}\label{subsec:mpi}
\emph{MPI} \cite{gropp-94} is an extensive and popular message passing standard popularly used for parallel
applications. It supports both synchronous and asynchronous forms of communication and is a standard
specified for developers and other MPI users. Several implementations of the MPI library exist, most popular
among them being \emph{MPICH} and \emph{OpenMPI}.
\subsection{Shared Memory}\label{subsec:shared_memory}
In parallel applications, it is possible for processes to communicate via shared data structures and share a
common address space. A producer can be allowed to insert data directly into the data structures of a remote
machine and the consumers can then remove this data for usage. Compared to message passing schemes, data
transmission via this shared data structure scheme takes less time. Pointers can be also be used instead of
copying large datasets to further improve the speed. To prevent multiple processes from corrupting the stored
data by simultaneously performing read-modify-write operations on the same data, access to the shared data
structures is protected usually using lock synchronization mechanisms. Locks protect entire sections of code
that are executed by different processes when accessing the same data structures such as mutexes or
semaphores. If too many processes contend for the lock simultaneously, the performance of shared memory data
structures may be adversely affected. This makes it difficult to scale systems that use shared memory only as
means of communication. A simple shared memory system is illustrated in Figure \ref{fig:shared_memory}.
\begin{figure}
\centering
\includegraphics[width=0.55\textwidth]{figures/shared_memory.pdf}
\caption{Shared Memory Communication Model}\label{fig:shared_memory}
\end{figure}
\chapter{Related Work}\label{chapter:related_work}
There are several popular implementations of Time Warp synchronized parallel simulation engines. An overview
of some of the most popular ones is presented in this chapter.
\section{Georgia Tech Time Warp (GTW)}
Georgia Tech Time Warp \cite{das-94} (GTW) is a general purpose Time Warp Simulator that is not used anymore.
However, it is significant in that it was one of the most popular shared-memory PDES engine available. The
shared memory multiprocessors such as the SparcStation and SGI PowerChallenge that \emph{GTW} was designed for
are now obsolete. Inspite of being out-dated, \emph{GTW} set the template for most modern Time Warp
simulators in use today. A single multi-threaded process runs a simulation model in \emph{GTW}.
Communication between threads is handled via shared data structures bound to a single processor.
A thread processes events from a statically allocated LP group in a simulation model. This ensures that the
same processing core is used for processing events from a LP. Each thread has its own set of data structures
to manage the distributed pending event set for its assigned LP group. The pending event set for each
processor consists of three main data structures listed below \cite{das-94}:
\begin{enumerate}
\item \emph{Message Queue} is a linked list that holds positive messages for LPs. Each message is mapped to
the processor that handles that LP\@. Tasks running on any processor can access this shared data
structure. Access is usually synchronized via lock.
\item \emph{Cancel Queue} is similar to \emph{Message Queue} but, instead of positive messages, it holds
negative messages (also known as anti-messages). This queue also requires a lock for synchronized access.
\item \emph{Event Queue} is a composite data structure that is used to directly schedule events for
processing. It holds both unprocessed and processed events. The event queue consists of two data
structures, one each for processed and unprocessed events. A doubly linked list holds the processed events
while a priority queue holds the unprocessed events. The priority queue can be configured by the user to be
either a calendar queue or a skew heap.
\end{enumerate}
\noindent
Positive messages sent between LPs are inserted directly into the \emph{Message Queue} while negative ones are
directly inserted into the \emph{Cancel Queue}. In order to process events, each thread first moves events
from the \emph{Message Queue} to the \emph{Event Queue} and then processes the rollbacks. Messages from the
\emph{Cancel Queue} are removed next for event cancellations and any associated rollbacks are processed. The
thread then processes one or more of the smallest events from the \emph{Event Queue} and adds those events to
the processed event list. The procedure mentioned above is repeated by all processors until the end of the
simulation. Figure \ref{gtw_processing} presents the pseudo-code for \emph{GTW's} main event processing loop.
\begin{algorithm}
\DontPrintSemicolon
\While{Event Queue is not empty} {
Transfer messages from Message Queue to Event Queue\;
Process any rollbacks\;
Remove anti-messages from Cancel Queue\;
Cancel events and process associated rollbacks\;
Remove one or more smallest events from unprocessed event pool in Event Queue\;
process those events and move them to processed event pool\;
}
\caption{Event Processing in GTW\cite{das-94,fujimoto-94}\label{gtw_processing}}
\end{algorithm}
Threads can remove multiple events from the \emph{Message Queues} and \emph{Cancel Queues} in order to avoid
frequent contention for access to these queues. This type of processing is called batch processing and the
number of events in any batch represents the batch interval. All events in a batch are processed serially
without consideration for rollbacks or cancellations.
In \emph{GTW}, there is no need to send anti-messages explicitly since all communication is over shared
memory. \emph{GTW} calls its cancellation method \emph{direct cancellation} because only a pointer to the
event is needed for event cancellation. GVT calculation also becomes faster on a shared memory platform since
shared data structures can be used more effectively instead of messages passed between processors. This
approach, however, limits the use of \emph{GTW} to only a single multiprocessor machine, that too optimized
for specific architectures.
The developer of a simulation model is responsible for partitioning the LPs among processors. This
partitioning must be done during initialization of the simulation model. This is an unreasonable expectation.
To effectively partition the LPs, the model developer would need to understand some the features of the
underlying architecture such as the number of processors. Additionally, this static partitioning approach
does not allow dynamic run-time balancing as there are no separate input queues for each LP\@. All
unprocessed events for each processor are held within a single message queue.
\section{Clustered Time Warp (CTW)}
Clustered Time Warp \cite{avril-95} (CTW) employs a novel hybrid approach to processing events in
\emph{clusters}. Events within a LP group (or \emph{cluster}) are processed sequentially while
synchronization between different clusters is via the Time Warp mechanism. CTW was developed primarily to
support digital logic simulation. Digital logic simulation tends to have localized computation within LP
groups and is suited for the computational framework of \emph{CTW}\@. In addition, some simulation models
such as digital logic have events with fine computational granularity and lots of LPs. In a Time Warp
simulator, this can cause significant increase in rollback count and memory footprint. Although \emph{CTW}
supports shared memory multiprocessors, it only uses shared memory with a custom message passing system.
Thus, \emph{CTW} is best suited for NUMA architectures and is not recommended for execution on a Beowulf
Cluster.
Each LP cluster contains a timezone table, an output queue, and a set of LPs. Each LP has an input queue and
a state queue. Based on events received from different LP clusters, the timezones (and timestamps) are
recorded in the timezone table. A new timezone is added to the timezone table when an event is received from
a remote cluster. Since anti-messages can only be sent between clusters and between LPs inside a cluster, a
single output queue per cluster is sufficient.
\emph{CTW}'s rollback scheme is called \emph{clustered rollback}. When a cluster receives a straggler event,
rollback will occur for all intra-cluster LPs that have processed events with timestamps greater than the
straggler event. \emph{Local rollback}, which is an alternative to \emph{clustered rollback}, allows the
straggler event to be inserted into the input queue of the receiver LP\@. The LP will trigger a rollback when
it detects this straggler event. Even though \emph{clustered rollbacks} may slow down computation by
triggering rollbacks unnecessarily in some LPs, it eliminates the need to store processed events. This
reduced memory footprint led \emph{CTW}'s designers to prefer \emph{clustered rollbacks}.
The timezone table in \emph{CTW} is used to determine the frequency of state savings. Here the timezone of
the last processed event is looked up before processing an event. The state is saved if this event belongs to
a different timezone. This infrequent approach can broadly be classified into two categories, namely
\emph{local} and \emph{clustered} checkpointing. In \emph{local checkpointing}, all LPs save their state
every time an event is processed in a new timezone, even if the event was received from a remote cluster. In
\emph{clustered checkpointing}, only the LP that receives an event from a remote cluster saves its state. A
higher state saving frequency of the latter approach means more events must be saved for coast forwarding
during state restoration. This increase in rollback computation and higher memory consumption was the reason
why \emph{CTW}'s designers preferred the \emph{local checkpointing} approach.
\section[\textsc{ross}]{Rensselaer's Optimistic Simulation System (ROSS)}
Rensselaer's Optimistic Simulation System (ROSS) \cite{carothers-00} is a general purpose simulator that
started life as a re-implementation of \emph{GTW}. Its capabilities were enhanced steadily over the years and
now it can run both conservatively and optimistically synchronized parallel simulations as well as sequential
simulations. It is widely used as a Timewarp-synchronized optimistic parallel simulator.
Although the event scheduling mechanism is similar to \emph{GTW}, \emph{ROSS} supports several priority queue
implementations. There are also choices for algorithms used in fossil collection, state saving, and GVT
calculation. A major difference between \emph{GTW} and \emph{ROSS} lies in the latter's use of processes
instead of threads. \emph{ROSS} uses MPI-based message passing instead of shared memory for inter-process
communication.
\emph{ROSS}, borrowing ideas from \emph{GTW}, maps every LP to a process. Each process contains its own
pending event set data structures and since these are not shared among processes, locks are unnecessary.
Although similar to the data structures in \emph{GTW}, \emph{ROSS} uses different naming convention as
mentioned below:
\begin{enumerate}
\item \emph{Event Queue} for a process holds all the positive events for all LPs linked to that process. In
addition, all remote events, both positive and negative, are held in the \emph{Event Queue}. The structure
is implemented as a linked list and is analogous to the \emph{Message Queue} in \emph{GTW}.
\item \emph{Cancel Queue} is similar to the data structure used in \emph{GTW} but without the lock (which is
unnecessary here). It is a linked list that holds negative events for all LPs for the corresponding
process.
\item \emph{Priority Queue} holds events in increasing order of timestamp. \emph{ROSS} allows the user to
configure the type of implementation, options available are Calendar Queue \cite{brown-88}, heap
\cite{williams-64}, Splay Tree \cite{sleator-85}, or AVL tree \cite{andelson-62}. The \emph{Priority Queue}
is analogous to the \emph{unprocessed event queue} in \emph{GTW}.
\end{enumerate}
\emph{ROSS} partitions the LPs on any process into groups called \emph{Kernel Processes (KPs)} in order to
reduce the time taken to fossil collect LPs. Similar to \emph{clustered rollback} in \emph{CTW}, all LPs in a
\emph{KP} undergo rollback and fossil collection together.
Instead of relying solely on traditional copy state saving to rollback LPs to a previous state,
\emph{ROSS} also uses \emph{reverse computation} \cite{carothers-99}.
\subsection{ROSS-MT}
ROSS-MT \cite{jagtap-12} is a multi-threaded version of \emph{ROSS}. Unlike the latter, \emph{ROSS-MT} is
optimized to use shared memory for inter-thread communication. As message passing using MPI is completely
absent, all events are directly inserted into the \emph{event queues}. \emph{Event Queues} are further
divided by possible senders in order to reduce the added contention on them. The memory management in
\emph{ROSS-MT} suits NUMA architecture.
\section{\textsc{warped}}
\textsc{warped} \cite{martin-96,ramanan-98-iscope} is a general-purpose discrete event simulator. It follows
Jefferson's classic model unlike the Time Warp simulators that came before it. Here each LP has its own
input, output, and state queues. \textsc{warped} was initially a process-based solution with communication
via message passing only. With the development of multicore processors, in order to improve concurrent
processing of events, each process was extended into multiple threads. The complexity of \textsc{warped}
became unmaintainable after few years because multiple researchers kept adding new features and algorithms to
it. Though configurable and modular in design, \textsc{warped} became too complex for new developers to learn
and enhance.
This has led to the development of \textsc{warped2}, which is based on the design and architecture
of \textsc{warped}. Chapter \ref{chapter:warped2_overview} provides a detailed overview of
\textsc{warped2}.
\section[\textsc{root-sim}]{The ROme OpTimistic Simulator (ROOT-Sim)}
The ROme OpTimistic Simulator \cite{pellegrini-11} (ROOT-Sim), a general purpose Time Warp simulator,
shares several common characteristics with \textsc{warped}. Both use MPI-based message passing and can be
classified as classic Time Warp implementations since each LP has its own \emph{input}, \emph{state} and
\emph{output} queues. ROOT-Sim is distinctly different from other Time Warp simulators when it comes to
internal instrumentation. Memory usage can be optimized using \emph{Dynamic Memory Logger and Restorer
(DyMeLoR)}. \emph{DyMeLoR} analyzes the performance of simulation models to understand which is a better
fit --- \emph{copy-state saving} or \emph{incremental state saving}. It can also transparently make a runtime
switch between these two state savings strategies.
Committed and Consistent Global Snapshot (CCGS) is a service that ROOT-Sim introduced for transparently
rebuilding the global snapshot of all LP states after each GVT calculation. During every GVT calculation,
each LP has access to its portion of the global snapshot. This service allows any simulation model to
implement its own custom global snapshot algorithm.
\emph{ROme} simulator does not advocate the use of shared memory event processing pool for the threads on an
SMP node\cite{vitali-12a,vitali-12b}. Instead, it relies heavily on a partitioned pending event structure
with dynamic thread count in each kernel instance. Depending on the current workload, the number of threads
can be scaled up or down.
In some of their recent papers \cite{marotta-16a,marotta-16b,ianni-17,marotta-17}, they have explored how all
threads can fully share the workload of events by loosely coupling simulation objects and threads. Though
this fully shared pending event pool will allow concurrent processing of any event, it is necessary to design
parallel ``insertion'' and ``dequeue'' operations that are efficient. Loosely based on the Calendar Queue
\cite{brown-88}, they claim that this scalable lock-free event pool is accessible concurrently with
\emph{O(1)} amortized time complexity for both ``insert'' and ``dequeue'' operations.
Speculative processing and rollback techniques are currently used for \emph{causality maintenance} by several
Time Warp-based PDES systems. Pellegrini \emph{et al} \cite{pellegrini-17} question the effectiveness
of this approach and propose an alternative preemptive approach which requires the CPU to be dynamically
reassigned to past unprocessed events (or operations such as \emph{rollback}). According to the results they
present, this approach allows the simulation to deviate less from the critical path by reacting promptly to
the causal violations. This reduces the overall number of \emph{causality violation} and is ideal for
multi-core \emph{x86-64} platforms.
\chapter[\textsc{warped2}]{The \textsc{warped2} Simulation Kernel}\label{chapter:warped2_overview}
\textsc{warped2} is a C++-based Time Warp synchronized parallel discrete event simulation kernel. It is
extensively used for research on parallel simulations on multi-core processors and clusters. The kernel also
provides a complete set of APIs that anyone can use to construct a simulation model. However, a fair amount
of knowledge about Time Warp concepts is necessary at the beginning due to the low-level nature of some of the
APIs (\emph{e.g.,} state definition and copy constructors for the state). The \textsc{warped2} kernel and
model git repositories are publicly available at \url{https://github.com/wilseypa/warped2} and
\url{https://github.com/wilseypa/warped2-models}. \textsc{warped2} supports two types of simulation, namely:
\begin{itemize}
\item sequential simulation, and
\item Time Warp synchronized parallel simulation.
\end{itemize}
\section{Conceptual Overview}\label{sec:conceptual_overview}
In order to make \textsc{warped2} easy to configure, extend and maintain, the simulation kernel uses a modular
design. At startup, all components are configured and created individually. The sub-components are accessed
through pointers. The kernel's event dispatcher supports both \emph{Sequential} and Time Warp
simulation. Sequential simulator is a fairly simple software as it contains only a single list of unprocessed
events and does not require integration of any other component. On the other hand, the Time Warp event
dispatcher requires integration of several components. Each component provides specialized algorithms that
deals with one of the following:
\begin{itemize}
\item event scheduling
\item state saving
\item cancellation
\item GVT
\item termination
\item interprocess communication
\item statistics
\end{itemize}
\noindent
Figure \ref{fig:warped2_architecture} illustrates the \textsc{warped2} implementation of Time Warp.
\begin{figure}
\centering
\includegraphics[width=0.9\textwidth]{figures/warped2_architecture.pdf}
\caption{Time Warp Components in \textsc{warped2}}\label{fig:warped2_architecture}
\end{figure}
The components in Time Warp can be broadly classified as either \textbf{local} or
\textbf{global}. \emph{Local} components are responsible for controlling the node-specific
activities of all LPs on that node, namely event processing, rollbacks and fossil collection.
\emph{Global} components, on the other hand, are responsible for cluster-wide control
issues such as GVT, termination detection and calculation of statistics. Communication to all
processes in a cluster environment is essential for determination of global state of the
system.
\subsection{Local Time Warp Components}\label{subsec:local_comporents}
The principle local components of \textsc{warped2} are:
\begin{itemize}
\item The \textbf{Event Set} contains all \emph{unprocessed} and \emph{processed} events for the LPs on a
node. The important data structures include:
\begin{itemize}
\item \emph{Input Queue} for each LP, and
\item \emph{Schedule Queue} for events waiting to be processed soon.
\end{itemize}
\noindent
The \emph{Event Set} is responsible for storage and scheduling of unprocessed events for execution. It also
processes rollbacks, and takes care of fossil collection for processed events. The \emph{Event Set} has
been designed for a multi-threaded environment and so the data structures that store pending events are
designed for thread-safe concurrent or serialized access.
\item The \textbf{Output Manager} is the module that holds all the \emph{Output Queues} necessary for storage
and tracking of outgoing events. \textsc{warped2} supports only \emph{aggressive cancellation}
\cite{fujimoto-90}. The \emph{Output Manager} allows the kernel to do the following:
\begin{itemize}
\item add newly generated events to its output queues,
\item dequeue events from output queues for processing a rollback, and
\item fossil collection of old output events to free up space on the output queues.
\end{itemize}
\item The \textbf{State Manager} is the module that holds all the saved \emph{LP states} inside its state
queues. \textsc{warped2} supports only \emph{periodic state saving} \cite{fleischmann-95}. The \emph{State
Manager} allows the kernel to do the following: