PerfAndPubTools

1. What is PerfAndPubTools?
2. Architecture and functions
2.1. Initialization
2.2. Time parsing functions (plugins)
2.3. Base functions
2.4. Speedups against one or more reference implementations
2.5. Pairwise speedups
2.6. Plotting
3. Default benchmark file format and alternative implementations
4. Tutorial - Performance analysis of sorting algorithms
4.1. Extract performance data from a file
4.2. Extract execution times from files in a folder
4.3. Average execution times and standard deviations
4.4. Compare multiple setups within the same implementation
4.5. Same as previous, with a linear plot
4.6. Compare different implementations
4.7. Speedup
4.8. Speedup for multiple algorithms and vector sizes
4.9. Custom speedup plots
4.10. Scalability of the different sorting algorithms for increasing vector sizes
4.11. Custom scalability plots
4.12. Produce a table instead of a plot
4.13. Pairwise speedups
5. A real world case - Performance analysis of a simulation model
5.1. Implementations and setups of the PPHPC agent-based model
5.2. Extract performance data from a file
5.3. Extract execution times from files in a folder
5.4. Average execution times and standard deviations
5.5. Compare multiple setups within the same implementation
5.6. Same as previous, with a log-log plot
5.7. Compare different implementations
5.8. Speedup
5.9. Speedup for multiple parallel implementations and sizes
5.10. Scalability of the different implementations for increasing model sizes
5.11. Scalability of parallel implementations for increasing number of threads
5.12. Performance of OD strategy for different values of b
5.13. Custom performance plot
5.14. Show a table instead of a plot
5.15. Complex tables
6. References

1. What is PerfAndPubTools?

PerfAndPubTools consists of a set of MATLAB/Octave functions for analyzing software performance benchmark results and producing associated publication quality materials. If you use this software please cite reference [1].

2. Architecture and functions

PerfAndPubTools is implemented in a layered architecture using a procedural programming approach, as shown in the following figure:

Performance analysis in PerfAndPubTools takes place at two levels: implementation and setup. The implementation level is meant to be associated with specific software implementations for performing a given task, for example a particular sorting algorithm or a simulation model realized in a certain programming language. Within the context of each implementation, the software can be executed under different setups. These can be different computational sizes (e.g. vector lengths in a sorting algorithm) or distinct execution parameters (e.g. number of threads used).

The next sections describe package initialization and configuration, as well as the functionality offered by PerfAndPubTools.

2.1. Initialization

PerfAndPubTools must be initialized before use by invoking the startup script. This script is automatically executed (and thus, initialization is automatically performed) if MATLAB or Octave are launched from the PerfAndPubTools folder.

Initialization adds PerfAndPubTools functions to the MATLAB/Octave path and declares the following global variables:

perfnpubtools_get_time_ - Specifies the function for parsing files with benchmarking information (default is get_time_gnu).
perfnpubtools_version - Specifies the package version.
perfnpubtools_remove_fastest - Specifies the number or percentage of fastest observations to remove (default is 0).
perfnpubtools_remove_slowest - Specifies the number or percentage of slowest observations to remove (default is 0).

Regarding the last two variables, if these contain integers >= 1, then they specify the number of observations to remove. If these values are reals in the ]0, 1[ interval, then they specify the percentage of observations to remove.

2.2. Time parsing functions (plugins)

get_time_gnu - Extract the user time, system time and elapsed time (in seconds), as well as the percentage of CPU usage, from files containing the default output of the GNU time command.
Additional functions/plugins for parsing other types of benchmarking files can easily be implemented by the user, as described in the next section.

2.3. Base functions

gather_times - Loads execution times from files in a given folder. This function uses the parsing function defined in the perfnpubtools_get_time_ global variable.
perfstats - Determines mean times and respective standard deviations of a computational experiment using folders of files containing benchmarking results, optionally plotting a scalability graph if different setups correspond to different computational work sizes.

2.4. Speedups against one or more reference implementations

speedup - Determines the average, maximum and minimum speedups against one or more reference implementations across a number of setups. Can optionally generate a bar plot displaying the various speedups.
times_table - Returns a matrix with useful contents for using in tables for publication, namely times (in seconds), absolute standard deviations (seconds), relative standard deviations and speedups (vs the specified implementations).
times_table_f - Returns a table with performance analysis results formatted in plain text or in LaTeX (the latter requires the siunitx, multirow and booktabs packages).

2.5. Pairwise speedups

The pairwise speedup functions, pwspeedup, pwtimes_table and pwtimes_table_f, have similar goals to their non-pairwise counterparts. They are, however, able to compare multiple implementations and setups under two different contexts. Using the sorting algorithms example, these functions can evaluate how different algorithms scale for increasing vector sizes under, e.g., a) two different programming languages, b) serial or parallel execution, c) two parallelization backends (e.g. CUDA and OpenCL), or any other pair of contexts deemed relevant.

2.6. Plotting

Although the perfstats, speedup and pwspeedup functions optionally create plots, these are mainly intended to provide visual feedback on the performance analysis being undertaken. Those needing more control over the final figures can customize the generated plots via the returned figure handles or create custom plots using the data returned by these functions. Either way, MATLAB/Octave plots can be used directly in publications, or converted to LaTeX using the excellent matlab2tikz script, as will be shown in some of the examples.

3. Default benchmark file format and alternative implementations

By default, PerfAndPubTools expects individual benchmarking results to be available as files containing the default output of GNU time command, for example:

512.66user 2.17system 8:01.34elapsed 106%CPU (0avgtext+0avgdata 1271884maxresident)k
0inputs+2136outputs (0major+49345minor)pagefaults 0swaps

This preset selection can be modified by specifying an alternative function in the perfnpubtools_get_time_ global variable. Alternative functions should respect the prototype defined by the default get_time_gnu function. More specifically, alternative functions should accept one argument specifying the full path of the file containing the profiling information, and should return a structure with (at least) the elapsed field, containing the elapsed time in seconds.

4. Tutorial - Performance analysis of sorting algorithms

This tutorial demonstrates how to benchmark several sorting algorithms with the GNU time command and analyze results with PerfAndPubTools. Since the GNU time command is not available on Windows, the data produced by the benchmarks is included in the package. Perform the following steps before proceeding:

Download and compile the sorttest.c program (instructions are available in the linked page).
Download the sorttest.py program.
(Optional) Confirm that the GNU time program is installed (instructions also available in sorttest.c).
In MATLAB/Octave create two variables, sortfolder_c and sortfolder_py, containing the full path where the C and Python benchmark output files will be placed, respectively. If the previous step was skipped, these variables can be set to the PerfAnfPubTools data folder.

GNU time is usually invoked as /usr/bin/time, but this can vary for different Linux distributions. On OSX it is invoked as gtime. The usual Linux invocation is used for throughout the tutorial, replace it as appropriate.

Since the GNU time program does not seem to be available for Windows, the given command-line instructions only run unmodified on Linux and OSX. On Windows, benchmark the sorttest.c program using an alternative approach and replace get_time_gnu with a function which parses the produced output. Otherwise, skip the command-line benchmarking instructions and directly use the benchmarking data bundled with PerfAndPubTools.

4.1. Extract performance data from a file

First, check that the sorttest.c program is working by testing the Quicksort algorithm with a vector of 1,000,000 random integers:

$ ./sorttest quick 1000000 2362 yes
Sorting Ok!

The value 2362 is the seed for the random number generator, and the optional yes parameter asks the program to output a message confirming if the sorting was successful.

Now, create a benchmark file with GNU time:

$ /usr/bin/time ./sorttest quick 1000000 2362 2> out.txt

The 2> part redirects the output of GNU time to a file called out.txt. This file can be parsed with the get_time_gnu function from MATLAB or Octave:

p = get_time_gnu('out.txt')

The function returns a structure with several fields:

p = 

       user: 0.2000
        sys: 0
    elapsed: 0.2000
        cpu: 99

4.2. Extract execution times from files in a folder

The gather_times function extracts execution times from multiple files in a folder. This is useful for analyzing average run times over a number of runs. First, we need to perform these runs. From a terminal, run the following command, which performs 10 runs of the sorttest.c program:

$ for RUN in {1..10}; do /usr/bin/time ./sorttest quick 1000000 $RUN 2> time_c_quick_1000000_${RUN}.txt; done

Note that each run is performed with a different seed, so that different vectors are sorted by Quicksort each turn. In MATLAB or Octave, use the gather_times function to extract execution times:

exec_time = gather_times('Quicksort', sortfolder_c, 'time_c_quick_1000000_*.txt');

The first parameter names the list of gathered times, and is used as metadata by other functions. The second parameter specifies the folder where the GNU time output files are located. The vector of execution times is in the elapsed field of the returned structure, i.e. exec_time.elapsed:

The gather_times function can automatically discard the fastest and/or slowest elapsed times if the perfnpubtools_remove_fastest and/or perfnpubtools_remove_slowest global variables are set accordingly. Although set to zero by default (i.e., no observations are discarded), these variables control the number of automatically discarded observations by gather_times and all the functions that directly or indirectly invoke it. Using the previous example:

% Remove the two fastest observations
perfnpubtools_remove_fastest = 2;

% Remove 30% of the slowest observations
perfnpubtools_remove_slowest = 0.3;

% Invoke gather_times
exec_time = gather_times('Quicksort', sortfolder_c, 'time_c_quick_1000000_*.txt');

If we now check the exec_time variable, we notice that the 2 fastest observations and 30% of the slowest ones (i.e., 3) are not present in exec_time.elapsed:

It's good practice to reset the global variables to their defaults when not planning to keep this setup:

% Reset defaults
perfnpubtools_remove_fastest = 0;
perfnpubtools_remove_slowest = 0;

4.3. Average execution times and standard deviations

In its most basic usage, the perfstats function obtains performance statistics. In this example, average execution times and standard deviations are obtained from the runs performed in the previous example:

qs1M = struct('sname', 'Quicksort', 'folder', sortfolder_c, 'files', 'time_c_quick_1000000_*.txt');
[avg_time, std_time] = perfstats(0, 'QuickSort', {qs1M})

avg_time =

    0.1340


std_time =

    0.0052

The qs1M variable specifies a setup. A setup is defined by the following fields: a) sname, the name of the setup; b) folder, the folder where to load benchmark files from; c) files, the specific files to load (using wildcards); and, d) csize, an optional computational size for plotting purposes.

4.4. Compare multiple setups within the same implementation

A more advanced use case for perfstats consists of comparing multiple setups associated with different computational sizes within the same implementation (e.g., the same sorting algorithm). A set of multiple setups is designated as an implementation spec, the basic object type accepted by the perfstats, speedup and times_table functions. An implementation spec defines one or more setups for a single implementation.

In this example we analyze how the performance of the Bubble sort algorithm varies for increasing vector sizes. First, perform a number of runs with sorttest.c using Bubble sort for vectors of size 10,000, 20,000 and 30,000:

$ for RUN in {1..10}; do for SIZE in 10000 20000 30000; do /usr/bin/time ./sorttest bubble $SIZE $RUN 2> time_c_bubble_${SIZE}_${RUN}.txt; done; done

Second, obtain the average times for the several vector sizes using perfstats:

% Specify the setups
bs10k = struct('sname', 'bs10k', 'folder', sortfolder_c, 'files', 'time_c_bubble_10000_*.txt');
bs20k = struct('sname', 'bs20k', 'folder', sortfolder_c, 'files', 'time_c_bubble_20000_*.txt');
bs30k = struct('sname', 'bs30k', 'folder', sortfolder_c, 'files', 'time_c_bubble_30000_*.txt');

% Specify the implementation spec
bs =  {bs10k, bs20k, bs30k};

% Determine average time for each setup
avg_time = perfstats(0, 'bubble', {bs10k, bs20k, bs30k})

avg_time =

    0.3220    1.3370    3.1070

4.5. Same as previous, with a linear plot

The perfstats function can also generate scalability plots. For this purpose, the computational size, csize, must be specified in each setup, and the first parameter should be a value between 1 (linear plot) and 4 (log-log plot), as shown in the following commands:

% Specify the setups
bs10k = struct('sname', 'bs10k', 'csize', 1e4, 'folder', sortfolder_c, 'files', 'time_c_bubble_10000_*.txt');
bs20k = struct('sname', 'bs20k', 'csize', 2e4, 'folder', sortfolder_c, 'files', 'time_c_bubble_20000_*.txt');
bs30k = struct('sname', 'bs30k', 'csize', 3e4, 'folder', sortfolder_c, 'files', 'time_c_bubble_30000_*.txt');

% Specify the implementation spec
bs =  {bs10k, bs20k, bs30k};

% The first parameter defines the plot type: 1 is a linear plot
perfstats(1, 'bubble', bs);

Error bars, showing the standard deviation, can be activated by passing a negative value as the first parameter:

% The first parameter defines the plot type: -1 is a linear plot
% with error bars showing the standard deviation
perfstats(-1, 'bubble', bs);

4.6. Compare different implementations

Besides comparing multiple setups within the same implementation, the perfstats function is also able to compare multiple setups from multiple implementations. The requirement is that, from implementation to implementation, the multiple setups are directly comparable, i.e., corresponding implementation specs should have the same sname and csize parameters.

First, perform a number of runs with sorttest.c using Merge sort and Quicksort for vectors of size 1e4, 1e5, 1e6 and 1e7:

$ for RUN in {1..10}; do for IMPL in merge quick; do for SIZE in 100000 1000000 10000000 100000000; do /usr/bin/time ./sorttest $IMPL $SIZE $RUN 2> time_c_${IMPL}_${SIZE}_${RUN}.txt; done; done; done

Second, use perfstats to plot the respective scalability graph:

% Specify Merge sort implementation specs
ms1e5 = struct('sname', '1e5', 'csize', 1e5, 'folder', sortfolder_c, 'files', 'time_c_merge_100000_*.txt');
ms1e6 = struct('sname', '1e6', 'csize', 1e6, 'folder', sortfolder_c, 'files', 'time_c_merge_1000000_*.txt');
ms1e7 = struct('sname', '1e7', 'csize', 1e7, 'folder', sortfolder_c, 'files', 'time_c_merge_10000000_*.txt');
ms1e8 = struct('sname', '1e8', 'csize', 1e8, 'folder', sortfolder_c, 'files', 'time_c_merge_100000000_*.txt');
ms = {ms1e5, ms1e6, ms1e7, ms1e8};

% Specify Quicksort implementation specs
qs1e5 = struct('sname', '1e5', 'csize', 1e5, 'folder', sortfolder_c, 'files', 'time_c_quick_100000_*.txt');
qs1e6 = struct('sname', '1e6', 'csize', 1e6, 'folder', sortfolder_c, 'files', 'time_c_quick_1000000_*.txt');
qs1e7 = struct('sname', '1e7', 'csize', 1e7, 'folder', sortfolder_c, 'files', 'time_c_quick_10000000_*.txt');
qs1e8 = struct('sname', '1e8', 'csize', 1e8, 'folder', sortfolder_c, 'files', 'time_c_quick_100000000_*.txt');
qs = {qs1e5, qs1e6, qs1e7, qs1e8};

% Plot comparison with a log-log plot
perfstats(4, 'Merge sort', ms, 'Quicksort', qs);

Like in the previous example, error bars are displayed by passing a negative value as the first parameter to perfstats:

% Plot comparison with a log-log plot with error bars
perfstats(-4, 'Merge sort', ms, 'Quicksort', qs);

4.7. Speedup

The speedup function is used to obtain relative speedups between different implementations. Using the variables defined in the previous example, the following instruction obtains the average, maximum and minimum speedups of Quicksort versus Merge sort for different vector sizes:

[s_avg, s_max, s_min] = speedup(0, 1, 'Merge sort', ms, 'Quicksort', qs);

Speedups can be obtained by getting the first element of the returned cell, i.e. by invoking s_avg{1}:

ans =

    1.0000    1.0000    1.0000    1.0000
    2.0000    1.7164    1.7520    1.6314

The second parameter indicates the reference implementation from which to calculate speedups. In this case, specifying 1 will return speedups against Merge sort. The first row of the previous matrix shows the speedup of Merge sort against itself, thus it is composed of ones. The second row shows the speedup of Quicksort versus Merge sort. If the second parameter is a vector, speedups against more than one implementation are returned.

Setting the first parameter to 1 will yield a bar plot displaying the average speedups:

speedup(1, 1, 'Merge sort', ms, 'Quicksort', qs);

Speedup bar plots also support error bars, but in this case error bars show the maximum and minimum speedups. Error bars are activated by passing a negative number as the first argument to speedup:

speedup(-1, 1, 'Merge sort', ms, 'Quicksort', qs);

4.8. Speedup for multiple algorithms and vector sizes

The speedup function is also able to determine relative speedups between different implementations for multiple computational sizes. In this example we plot the average speedup of several sorting algorithms against Bubble sort and Selection sort for vector sizes 1e5, 2e5, 3e5 and 4e5.

First, perform a number of runs using the four sorting algorithms made available by sorttest.c for the specified vector sizes:

$ for RUN in {1..10}; do for IMPL in bubble selection merge quick; do for SIZE in 100000 200000 300000 400000; do /usr/bin/time ./sorttest $IMPL $SIZE $RUN 2> time_c_${IMPL}_${SIZE}_${RUN}.txt; done; done; done

Then, in MATLAB or Octave, specify the implementation specs for each sorting algorithm and setup combination, and use the speedup function to plot the respective speedup plot:

% Specify Bubble sort implementation specs
bs1e5 = struct('sname', '1e5', 'csize', 1e5, 'folder', sortfolder_c, 'files', 'time_c_bubble_100000_*.txt');
bs2e5 = struct('sname', '2e5', 'csize', 2e5, 'folder', sortfolder_c, 'files', 'time_c_bubble_200000_*.txt');
bs3e5 = struct('sname', '3e5', 'csize', 3e5, 'folder', sortfolder_c, 'files', 'time_c_bubble_300000_*.txt');
bs4e5 = struct('sname', '4e5', 'csize', 4e5, 'folder', sortfolder_c, 'files', 'time_c_bubble_400000_*.txt');
bs = {bs1e5, bs2e5, bs3e5, bs4e5};

% Specify Selection sort implementation specs
ss1e5 = struct('sname', '1e5', 'csize', 1e5, 'folder', sortfolder_c, 'files', 'time_c_selection_100000_*.txt');
ss2e5 = struct('sname', '2e5', 'csize', 2e5, 'folder', sortfolder_c, 'files', 'time_c_selection_200000_*.txt');
ss3e5 = struct('sname', '3e5', 'csize', 3e5, 'folder', sortfolder_c, 'files', 'time_c_selection_300000_*.txt');
ss4e5 = struct('sname', '4e5', 'csize', 4e5, 'folder', sortfolder_c, 'files', 'time_c_selection_400000_*.txt');
ss = {ss1e5, ss2e5, ss3e5, ss4e5};

% Specify Merge sort implementation specs
ms1e5 = struct('sname', '1e5', 'csize', 1e5, 'folder', sortfolder_c, 'files', 'time_c_merge_100000_*.txt');
ms2e5 = struct('sname', '2e5', 'csize', 2e5, 'folder', sortfolder_c, 'files', 'time_c_merge_200000_*.txt');
ms3e5 = struct('sname', '3e5', 'csize', 3e5, 'folder', sortfolder_c, 'files', 'time_c_merge_300000_*.txt');
ms4e5 = struct('sname', '4e5', 'csize', 4e5, 'folder', sortfolder_c, 'files', 'time_c_merge_400000_*.txt');
ms = {ms1e5, ms2e5, ms3e5, ms4e5};

% Specify Quicksort implementation specs
qs1e5 = struct('sname', '1e5', 'csize', 1e5, 'folder', sortfolder_c, 'files', 'time_c_quick_100000_*.txt');
qs2e5 = struct('sname', '2e5', 'csize', 2e5, 'folder', sortfolder_c, 'files', 'time_c_quick_200000_*.txt');
qs3e5 = struct('sname', '3e5', 'csize', 3e5, 'folder', sortfolder_c, 'files', 'time_c_quick_300000_*.txt');
qs4e5 = struct('sname', '4e5', 'csize', 4e5, 'folder', sortfolder_c, 'files', 'time_c_quick_400000_*.txt');
qs = {qs1e5, qs2e5, qs3e5, qs4e5};

% Plot speedup of multiple sorting algorithms against Bubble sort
% Setting the first parameter to 2 will yields a log-scale bar plot
speedup(2, 1, 'Bubble', bs, 'Selection', ss, 'Merge', ms, 'Quick', qs);

% Place legend in a better position
legend(gca, 'Location', 'NorthWest');

% Plot speedup of multiple sorting algorithms against Selection sort
speedup(1, 1, 'Selection', ss, 'Merge', ms, 'Quick', qs);

% Place legend in a better position
legend(gca, 'Location', 'NorthWest');

If we require error bars, the first parameter should be a negative value:

% Same plot with error bars
speedup(-1, 1, 'Selection', ss, 'Merge', ms, 'Quick', qs);

% Place legend in a better position
l = legend(gca);
set(l, 'Location', 'NorthWest');

Generated plots can be customized using the MATLAB or Octave GUI, or programmatically. The following commands change some of the default properties of the previous plot:

% Get the current axes children objects
ch = get(gca, 'Children');

% Set the color of the '1e5' bars to white
set(ch(8), 'FaceColor', 'w');

% This is required in Octave for updating the legend
legend(gca);

% Change the default labels
ylabel('Average speedup over Selection sort');
xlabel('Algorithms');

4.9. Custom speedup plots

For more control over the speedup plots, it may preferable to use the data provided by speedup and build the plots from the beginning. Continuing from the previous example, the following sequence of instructions generates a customized plot showing the speedup of the sorting algorithms against Bubble sort:

% Obtain speedup of multiple sorting algorithms against Bubble sort, no plot
s = speedup(0, 1, 'Bubble', bs, 'Selection', ss, 'Merge', ms, 'Quick', qs);

% Generate basic speedup bar plot (first element of s cell array and rows 2 to 4,
% to avoid displaying the speedup of Bubble sort against itself)
h = bar(s{1}(2:4, :), 'basevalue', 1);

% Customize plot
set(h(1), 'FaceColor', [0 0 0]);
set(h(2), 'FaceColor', [0.33 0.33 0.33]);
set(h(3), 'FaceColor', [0.66 0.66 0.66]);
set(h(4), 'FaceColor', [1 1 1]);
set(gca, 'YScale', 'log');
grid on;
grid minor;
legend({'1 \times 10^5', '2 \times 10^5', '3 \times 10^5', '4 \times 10^5'}, 'Location', 'NorthWest');
set(gca, 'XTickLabel', {'Selection', 'Merge', 'Quick'});
ylabel('Speedup');

Although the figure seems appropriate for publication purposes, it can be converted to native LaTeX via the matlab2tikz script:

cleanfigure();
matlab2tikz('standalone', true, 'filename', 'image.tex');

Compiling the image.tex file with a LaTeX engine yields the following figure:

4.10. Scalability of the different sorting algorithms for increasing vector sizes

Continuing from the previous example, we can use perfstats to determine and plot the scalability of the different sorting algorithms for increasing vector sizes:

p = perfstats(3, 'Bubble', bs, 'Selection', ss, 'Merge', ms, 'Quick', qs);

The values plotted are returned in variable p:

p =

   36.0040  144.8210  325.1730  577.8600
    9.5270   38.0500   88.5130  153.6560
    0.0200    0.0410    0.0600    0.0850
    0.0100    0.0200    0.0300    0.0510

4.11. Custom scalability plots

In a similar fashion to the speedup plots, finer control over the scalability plots is possible by directly using the data provided by perfstats. The following sequence of instructions customizes the figure in the previous example:

% Plot data from perfstats in y-axis in log-scale
h = semilogy(p', 'Color', 'k');

% Set different markers for the various lines
set(h(1), 'Marker', 'd', 'MarkerFaceColor', 'w');
set(h(2), 'Marker', 'o', 'MarkerFaceColor', 'k');
set(h(3), 'Marker', '*');
set(h(4), 'Marker', 's', 'MarkerFaceColor', [0.8 0.8 0.8]);

% Make space for legend and add legend
ylim([1e-2 3e3]);
legend({'Bubble', 'Selection', 'Merge', 'Quick'}, 'Location', 'NorthWest');

% Set horizontal ticks
set(gca, 'XTick', 1:4);
set(gca, 'XTickLabel', {'1e5', '2e5', '3e5', '4e5'});

% Add a grid
grid on;

% Add x and y labels
xlabel('Vector size');
ylabel('Time (s)');

We can further improve the figure, and convert it to LaTeX with matlab2tikz:

% Minor grids in LaTeX image are not great, so remove them
grid minor;

% Set horizontal ticks, LaTeX-style
set(gca, 'XTickLabel', {'$1 \times 10^5$', '$2 \times 10^5$', '$3 \times 10^5$', '$4 \times 10^5$'});

% Export figure to LaTeX
cleanfigure();
matlab2tikz('standalone', true, 'filename', 'image.tex');

Compiling the image.tex file with a LaTeX engine yields the following figure:

4.12. Produce a table instead of a plot

The times_table and times_table_f functions can be used to create performance tables formatted in plain text or LaTeX. Using the data defined in the previous examples, the following commands produce a plain text table comparing the performance of the different sorting algorithms:

% Put data in table format
tdata = times_table(1, 'Bubble', bs, 'Selection', ss, 'Merge', ms, 'Quick', qs);

% Print a plain text table
times_table_f(0, 'vs Bubble', tdata)

                  -----------------------------------------------
                  |                       vs Bubble             |
-----------------------------------------------------------------
| Imp.   | Set.   |   t(s)     |   std     |  std%  | x Bubble  |
-----------------------------------------------------------------
| Bubble |    1e5 |         36 |     0.887 |   2.46 |         1 |
|        |    2e5 |        145 |      2.92 |   2.02 |         1 |
|        |    3e5 |        325 |      6.19 |   1.90 |         1 |
|        |    4e5 |        578 |      6.38 |   1.10 |         1 |
-----------------------------------------------------------------
| Select |    1e5 |       9.53 |     0.069 |   0.72 |      3.78 |
|        |    2e5 |         38 |     0.283 |   0.74 |      3.81 |
|        |    3e5 |       88.5 |       3.7 |   4.18 |      3.67 |
|        |    4e5 |        154 |      3.06 |   1.99 |      3.76 |
-----------------------------------------------------------------
|  Merge |    1e5 |       0.02 |  3.66e-18 |   0.00 |   1.8e+03 |
|        |    2e5 |      0.041 |   0.00316 |   7.71 |  3.53e+03 |
|        |    3e5 |       0.06 |  1.46e-17 |   0.00 |  5.42e+03 |
|        |    4e5 |      0.085 |    0.0127 |  14.93 |   6.8e+03 |
-----------------------------------------------------------------
|  Quick |    1e5 |       0.01 |  1.83e-18 |   0.00 |   3.6e+03 |
|        |    2e5 |       0.02 |  3.66e-18 |   0.00 |  7.24e+03 |
|        |    3e5 |       0.03 |  7.31e-18 |   0.00 |  1.08e+04 |
|        |    4e5 |      0.051 |   0.00316 |   6.20 |  1.13e+04 |
-----------------------------------------------------------------

In order to obtain the equivalent LaTeX table, we set the first parameter to 1 instead of 0:

% Print a Latex table
times_table_f(1, 'vs Bubble', tdata)

4.13. Pairwise speedups

Pairwise comparisons allow to compare multiple implementations and setups under two different contexts. In this example, the pairwise comparison functionality of PerfAndPubTools is used to determine the speedup of C versus Python realizations (contexts) of the different sorting algorithms (implementations), for different vector sizes (setups).

As described in the previous examples, the implementation specs for the C realizations/setups are given in the bs, ss, ms and qs variables for the Bubble, Selection, Merge and Quick sort algorithms, respectively. As such, we need to get the corresponding implementation specs for the Python realizations (note that this data is also bundled in the data folder):

for RUN in {1..10}; do for IMPL in bubble selection merge quick; do for SIZE in 100000 200000 300000 400000; do /usr/bin/time python sorttest.py $IMPL $SIZE $RUN 2> time_py_${IMPL}_${SIZE}_${RUN}.txt; done; done; done

With the data generated by this computational experiment, it is now possible to to specify the Python algorithm implementations and setups (remember to specify the folder containing the results in the sortfolder_py variable):

% Specify Bubble sort implementation specs
bs1e5py = struct('sname', '1e5', 'csize', 1e5, 'folder', sortfolder_py, 'files', 'time_py_bubble_100000_*.txt');
bs2e5py = struct('sname', '2e5', 'csize', 2e5, 'folder', sortfolder_py, 'files', 'time_py_bubble_200000_*.txt');
bs3e5py = struct('sname', '3e5', 'csize', 3e5, 'folder', sortfolder_py, 'files', 'time_py_bubble_300000_*.txt');
bs4e5py = struct('sname', '4e5', 'csize', 4e5, 'folder', sortfolder_py, 'files', 'time_py_bubble_400000_*.txt');
bspy = {bs1e5py, bs2e5py, bs3e5py, bs4e5py};

% Specify Selection sort implementation specs
ss1e5py = struct('sname', '1e5', 'csize', 1e5, 'folder', sortfolder_py, 'files', 'time_py_selection_100000_*.txt');
ss2e5py = struct('sname', '2e5', 'csize', 2e5, 'folder', sortfolder_py, 'files', 'time_py_selection_200000_*.txt');
ss3e5py = struct('sname', '3e5', 'csize', 3e5, 'folder', sortfolder_py, 'files', 'time_py_selection_300000_*.txt');
ss4e5py = struct('sname', '4e5', 'csize', 4e5, 'folder', sortfolder_py, 'files', 'time_py_selection_400000_*.txt');
sspy = {ss1e5py, ss2e5py, ss3e5py, ss4e5py};

% Specify Merge sort implementation specs
ms1e5py = struct('sname', '1e5', 'csize', 1e5, 'folder', sortfolder_py, 'files', 'time_py_merge_100000_*.txt');
ms2e5py = struct('sname', '2e5', 'csize', 2e5, 'folder', sortfolder_py, 'files', 'time_py_merge_200000_*.txt');
ms3e5py = struct('sname', '3e5', 'csize', 3e5, 'folder', sortfolder_py, 'files', 'time_py_merge_300000_*.txt');
ms4e5py = struct('sname', '4e5', 'csize', 4e5, 'folder', sortfolder_py, 'files', 'time_py_merge_400000_*.txt');
mspy = {ms1e5py, ms2e5py, ms3e5py, ms4e5py};

% Specify Quicksort implementation specs
qs1e5py = struct('sname', '1e5', 'csize', 1e5, 'folder', sortfolder_py, 'files', 'time_py_quick_100000_*.txt');
qs2e5py = struct('sname', '2e5', 'csize', 2e5, 'folder', sortfolder_py, 'files', 'time_py_quick_200000_*.txt');
qs3e5py = struct('sname', '3e5', 'csize', 3e5, 'folder', sortfolder_py, 'files', 'time_py_quick_300000_*.txt');
qs4e5py = struct('sname', '4e5', 'csize', 4e5, 'folder', sortfolder_py, 'files', 'time_py_quick_400000_*.txt');
qspy = {qs1e5py, qs2e5py, qs3e5py, qs4e5py};

The pairwise speedup comparison between C and Python realizations is performed with the pwspeedup function as follows:

c_py = pwspeedup(-1, {'C', 'Python'}, 'Bubble', bs, bspy, 'Selection', ss, sspy, 'Merge', ms, mspy, 'Quick', qs, qspy);
legend({'Bubble', 'Selection', 'Merge', 'Quick'}, 'Location', 'NorthOutside', 'Orientation', 'horizontal');

The exact speedups can be analyzed by looking at the c_py variable:

c_py =

   12.3946   22.9515   28.4500   39.0000
   14.6980   31.5745   25.6585   40.7500
   15.5717   35.9958   27.1667   43.7000
   17.2447   40.9063   26.2588   35.2745

Note that, analogously to speedup, pwspeedup also returns the maximum and minimum speedups, mean and standard deviations of computational times, and so on.

The pwtimes_table and pwtimes_table_f functions, like their non-pairwise counterparts, can be used to create performance tables formatted in plain text or LaTeX. Using the data defined in the previous examples, the following commands produce a plain text table comparing the performance of the C and Python programming languages when used for implementing various sorting algorithms:

% Put data in table format
tdata = pwtimes_table({'C', 'Python'}, 'Bubble', bs, bspy, 'Selection', ss, sspy, 'Merge', ms, mspy, 'Quick', qs, qspy);

% Print a plain text table
pwtimes_table_f(0, tdata)

                  ----------------------------------------------------------------------
                  |                 C                |              Python             |
----------------------------------------------------------------------------------------------------------------------------
| Imp.   | Set.   |    t(s)     |   std     |  std%  |    t(s)    |   std     |  std%  | Avg.Spdup | Max.Spdup | Min.Spdup |
----------------------------------------------------------------------------------------------------------------------------
| Bubble |   1e5  |          36 |     0.887 |   2.46 |        446 |        14 |   3.14 |      12.4 |      13.3 |      11.3 |
|        |   2e5  |         145 |      2.92 |   2.02 |   2.13e+03 |       121 |   5.66 |      14.7 |        17 |      13.3 |
|        |   3e5  |         325 |      6.19 |   1.90 |   5.06e+03 |       123 |   2.43 |      15.6 |      16.4 |      14.6 |
|        |   4e5  |         578 |      6.38 |   1.10 |   9.96e+03 |       750 |   7.53 |      17.2 |      20.7 |      15.5 |
----------------------------------------------------------------------------------------------------------------------------
| Select |   1e5  |        9.53 |     0.069 |   0.72 |        219 |      6.99 |   3.20 |        23 |      24.6 |      21.9 |
|        |   2e5  |          38 |     0.283 |   0.74 |    1.2e+03 |      38.3 |   3.18 |      31.6 |      33.2 |      29.2 |
|        |   3e5  |        88.5 |       3.7 |   4.18 |   3.19e+03 |      89.4 |   2.80 |        36 |      39.7 |      32.6 |
|        |   4e5  |         154 |      3.06 |   1.99 |   6.29e+03 |       303 |   4.82 |      40.9 |      45.1 |      35.9 |
----------------------------------------------------------------------------------------------------------------------------
|  Merge |   1e5  |        0.02 |  3.66e-18 |   0.00 |      0.569 |    0.0582 |  10.23 |      28.5 |      35.5 |        25 |
|        |   2e5  |       0.041 |   0.00316 |   7.71 |       1.05 |   0.00789 |   0.75 |      25.7 |      26.8 |      20.8 |
|        |   3e5  |        0.06 |  1.46e-17 |   0.00 |       1.63 |    0.0149 |   0.91 |      27.2 |      27.7 |      26.8 |
|        |   4e5  |       0.085 |    0.0127 |  14.93 |       2.23 |     0.022 |   0.99 |      26.3 |      28.5 |      18.4 |
----------------------------------------------------------------------------------------------------------------------------
|  Quick |   1e5  |        0.01 |  1.83e-18 |   0.00 |       0.39 |    0.0105 |   2.70 |        39 |        41 |        38 |
|        |   2e5  |        0.02 |  3.66e-18 |   0.00 |      0.815 |    0.0246 |   3.02 |      40.8 |      43.5 |      39.5 |
|        |   3e5  |        0.03 |  7.31e-18 |   0.00 |       1.31 |    0.0647 |   4.94 |      43.7 |      48.3 |      41.3 |
|        |   4e5  |       0.051 |   0.00316 |   6.20 |        1.8 |     0.052 |   2.89 |      35.3 |        38 |      28.7 |
----------------------------------------------------------------------------------------------------------------------------

In order to obtain the equivalent LaTeX table, we set the first parameter to 1 instead of 0:

% Print a Latex table
pwtimes_table_f(1, tdata)

5. A real world case - Performance analysis of a simulation model

Here we describe how PerfAndPubTools was used to analyze performance data of multiple implementations of a simulation model, replicating results presented in a peer-reviewed paper [2]. The initial benchmarking steps are skipped in these examples, but the produced data and the scripts used to generate it are also made available.

The examples in this section use the following dataset:

Unpack the dataset to any folder and specify the complete path to this folder in variable datafolder, e.g.:

datafolder = 'path/to/dataset';

This dataset corresponds to the results presented in reference [2], which compares the performance of several implementations of the PPHPC agent-based model. Among several aspects of PerfAndPubTools, the following examples show how to replicate these results.

5.1. Implementations and setups of the PPHPC agent-based model

While most details about PPHPC and its various implementations are not important for this discussion, is convenient to know which implementations and setups were experimented with in reference [2]. A total of six implementations of the PPHPC model were compared:

Implementation	Description
NL	NetLogo implementation (no parallelization).
ST	Java single-thread implementation (no parallelization).
EQ	Java parallel implementation (equal work).
EX	Java parallel implementation (equal work, reproducible).
ER	Java parallel implementation (row-wise synchronization).
OD	Java parallel implementation (on-demand work).

A number of setups are directly related with the model itself, namely model size and parameter set. Concerning model size, PPHPC was benchmarked with sizes 100, 200, 400, 800 and 1600. Each size corresponds to the size of the environment in which the agents act, e.g. size 200 corresponds to a 200 x 200 environment. Besides model size, PPHPC was also benchmarked with two parameter sets, simply designated as parameter set 1 and parameter set 2. The latter typically yields simulations with more agents.

Other setups are associated with computational aspects of model execution, more specifically number of threads (for parallel implementations) and value of the b parameter (for OD implementation only).

The dataset contains performance data (in the form of GNU time default output) for 10 runs of all setup combinations (i.e. model size, parameter set, number of threads and value of the b parameter, where applicable).

5.2. Extract performance data from a file

The get_time_gnu function extracts performance data from one file containing the default output of GNU time command. For example:

p = get_time_gnu([datafolder '/times/NL/time100v1r1.txt'])

The function returns a structure with several fields:

p = 

       user: 17.6800
        sys: 0.3200
    elapsed: 16.5900
        cpu: 108

5.3. Extract execution times from files in a folder

The gather_times function extracts execution times from multiple files in a folder, as shown in the following command:

exec_time = gather_times('NetLogo', [datafolder '/times/NL'], 'time100v1*.txt');

The vector of execution times is in the elapsed field of the returned structure:

exec_time.elapsed

The gather_times function uses get_time_gnu internally by default. However, other functions can be specified in the first line of the gather_times function body, allowing PerfAndPubTools to support benchmarking formats other than the output of GNU time. Alternatives to get_time_gnu are only required to return a struct with the elapsed field, indicating the duration (in seconds) of a program execution.

5.4. Average execution times and standard deviations

In its most basic usage, the perfstats function obtains performance statistics. In this example, average execution times and standard deviations are obtained from 10 replications of the Java single-threaded (ST) implementation of PPHPC for size 800, parameter set 2:

st800v2 = struct('sname', '800v1', 'folder', [datafolder '/times/ST'], 'files', 't*800v2*.txt');
[avg_time, std_time] = perfstats(0, 'ST', {st800v2})

avg_time =

  699.5920


std_time =

    3.6676

The perfstats function uses gather_times internally.

5.5. Compare multiple setups within the same implementation

A more advanced use case for perfstats consists of comparing multiple setups, associated with different computational sizes, within the same implementation. For example, considering the Java ST implementation of the PPHPC model, the following instructions analyze how its performance varies for increasing model sizes:

% Specify implementations specs for each model size
st100v2 = struct('sname', '100v2', 'folder', [datafolder '/times/ST'], 'files', 't*100v2*.txt');
st200v2 = struct('sname', '200v2', 'folder', [datafolder '/times/ST'], 'files', 't*200v2*.txt');
st400v2 = struct('sname', '400v2', 'folder', [datafolder '/times/ST'], 'files', 't*400v2*.txt');
st800v2 = struct('sname', '800v2', 'folder', [datafolder '/times/ST'], 'files', 't*800v2*.txt');
st1600v2 = struct('sname', '1600v2', 'folder', [datafolder '/times/ST'], 'files', 't*1600v2*.txt');

% Obtain the average time for increasing model sizes
avg_time = perfstats(0, 'ST', {st100v2, st200v2, st400v2, st800v2, st1600v2})

avg_time =

   1.0e+03 *

    0.0053    0.0361    0.1589    0.6996    2.9572

5.6. Same as previous, with a log-log plot

The perfstats function can also be used to generate scalability plots. For this purpose, the computational size, csize, must be specified in each setup, and the first parameter of perfstats should be a value between 1 (linear plot) and 4 (log-log plot), as shown in the following code snippet:

% Specify implementations specs for each model size, indicating the csize key
st100v2 = struct('sname', '100v2', 'csize', 100, 'folder', [datafolder '/times/ST'], 'files', 't*100v2*.txt');
st200v2 = struct('sname', '200v2', 'csize', 200, 'folder', [datafolder '/times/ST'], 'files', 't*200v2*.txt');
st400v2 = struct('sname', '400v2', 'csize', 400, 'folder', [datafolder '/times/ST'], 'files', 't*400v2*.txt');
st800v2 = struct('sname', '800v2', 'csize', 800, 'folder', [datafolder '/times/ST'], 'files', 't*800v2*.txt');
st1600v2 = struct('sname', '1600v2', 'csize', 1600, 'folder', [datafolder '/times/ST'], 'files', 't*1600v2*.txt');

% The first parameter defines the plot type: 4 is a log-log plot
perfstats(4, 'ST', {st100v2, st200v2, st400v2, st800v2, st1600v2});

Error bars showing the standard deviation can be requested by passing a negative value as the first parameter to perfstats:

% The value -4 indicates a log-log plot with error bars
perfstats(-4, 'ST', {st100v2, st200v2, st400v2, st800v2, st1600v2});

Due to the run time variability being very low, the error bars are not very useful in this case.

5.7. Compare different implementations

Besides comparing multiple setups within the same implementation, the perfstats function is also able to compare multiple setups from a number implementations. The requirement is that, from implementation to implementation, the multiple setups are directly comparable, i.e., corresponding implementation specs should have the same sname and csize parameters, as shown in the following commands, where the NetLogo (NL) and Java single-thread (ST) PPHPC implementations are compared for sizes 100 to 1600, parameter set 1:

% Specify NetLogo implementation specs
nl100v1 = struct('sname', '100v1', 'csize', 100, 'folder', [datafolder '/times/NL'], 'files', 't*100v1*.txt');
nl200v1 = struct('sname', '200v1', 'csize', 200, 'folder', [datafolder '/times/NL'], 'files', 't*200v1*.txt');
nl400v1 = struct('sname', '400v1', 'csize', 400, 'folder', [datafolder '/times/NL'], 'files', 't*400v1*.txt');
nl800v1 = struct('sname', '800v1', 'csize', 800, 'folder', [datafolder '/times/NL'], 'files', 't*800v1*.txt');
nl1600v1 = struct('sname', '1600v1', 'csize', 1600, 'folder', [datafolder '/times/NL'], 'files', 't*1600v1*.txt');
nlv1 = {nl100v1, nl200v1, nl400v1, nl800v1, nl1600v1};

% Specify Java ST implementation specs
st100v1 = struct('sname', '100v1', 'csize', 100, 'folder', [datafolder '/times/ST'], 'files', 't*100v1*.txt');
st200v1 = struct('sname', '200v1', 'csize', 200, 'folder', [datafolder '/times/ST'], 'files', 't*200v1*.txt');
st400v1 = struct('sname', '400v1', 'csize', 400, 'folder', [datafolder '/times/ST'], 'files', 't*400v1*.txt');
st800v1 = struct('sname', '800v1', 'csize', 800, 'folder', [datafolder '/times/ST'], 'files', 't*800v1*.txt');
st1600v1 = struct('sname', '1600v1', 'csize', 1600, 'folder', [datafolder '/times/ST'], 'files', 't*1600v1*.txt');
stv1 = {st100v1, st200v1, st400v1, st800v1, st1600v1};

% Plot comparison
perfstats(4, 'NL', nlv1, 'ST', stv1);

5.8. Speedup

The speedup function is used to obtain relative speedups between different implementations. Using the variables defined in the previous example, the average, maximum and minimum speedups of the Java ST version over the NetLogo implementation for different model sizes can be obtained with the following instruction:

[s_avg, s_max, s_min] = speedup(0, 1, 'NL', nlv1, 'ST', stv1);

The first element of the returned cell, i.e. s_avg{1}, contains the speedups:

ans =

    1.0000    1.0000    1.0000    1.0000    1.0000
    5.8513    8.2370    5.7070    5.4285    5.4331

The second parameter of the speedup function indicates the reference implementation from which to calculate speedups. In this case, specifying 1 will return speedups against the NetLogo implementation. The first row of the matrix in s_avg{1} shows the speedup of the NetLogo implementation against itself, thus it is composed of ones. The second row shows the speedup of the Java ST implementation versus the NetLogo implementation. If the second parameter of the speedup function is a vector, speedups against more than one implementation are returned in s_avg, s_max and s_min.

Setting the first parameter of speedup to 1 will yield a bar plot displaying the relative speedups:

speedup(1, 1, 'NL', nlv1, 'ST', stv1);

Error bars representing the maximum and minimum speedups can be requested by passing a negative value as the first parameter:

speedup(-1, 1, 'NL', nlv1, 'ST', stv1);

5.9. Speedup for multiple parallel implementations and sizes

The speedup function is also able to determine speedups between different implementations for multiple computational sizes. In this example we plot the speedup of several PPHPC parallel Java implementations against the NetLogo and Java single-thread implementations for multiple sizes. This example uses the nlv1 and stv1 variables defined in a previous example, and the plotted results are equivalent to figures 4a and 4b of reference [2]:

% Specify Java EQ implementation specs (runs with 12 threads)
eq100v1t12 = struct('sname', '100v1', 'csize', 100, 'folder', [datafolder '/times/EQ'], 'files', 't*100v1*t12r*.txt');
eq200v1t12 = struct('sname', '200v1', 'csize', 200, 'folder', [datafolder '/times/EQ'], 'files', 't*200v1*t12r*.txt');
eq400v1t12 = struct('sname', '400v1', 'csize', 400, 'folder', [datafolder '/times/EQ'], 'files', 't*400v1*t12r*.txt');
eq800v1t12 = struct('sname', '800v1', 'csize', 800, 'folder', [datafolder '/times/EQ'], 'files', 't*800v1*t12r*.txt');
eq1600v1t12 = struct('sname', '1600v1', 'csize', 1600, 'folder', [datafolder '/times/EQ'], 'files', 't*1600v1*t12r*.txt');
eqv1t12 = {eq100v1t12, eq200v1t12, eq400v1t12, eq800v1t12, eq1600v1t12};

% Specify Java EX implementation specs (runs with 12 threads)
ex100v1t12 = struct('sname', '100v1', 'csize', 100, 'folder', [datafolder '/times/EX'], 'files', 't*100v1*t12r*.txt');
ex200v1t12 = struct('sname', '200v1', 'csize', 200, 'folder', [datafolder '/times/EX'], 'files', 't*200v1*t12r*.txt');
ex400v1t12 = struct('sname', '400v1', 'csize', 400, 'folder', [datafolder '/times/EX'], 'files', 't*400v1*t12r*.txt');
ex800v1t12 = struct('sname', '800v1', 'csize', 800, 'folder', [datafolder '/times/EX'], 'files', 't*800v1*t12r*.txt');
ex1600v1t12 = struct('sname', '1600v1', 'csize', 1600, 'folder', [datafolder '/times/EX'], 'files', 't*1600v1*t12r*.txt');
exv1t12 = {ex100v1t12, ex200v1t12, ex400v1t12, ex800v1t12, ex1600v1t12};

% Specify Java ER implementation specs (runs with 12 threads)
er100v1t12 = struct('sname', '100v1', 'csize', 100, 'folder', [datafolder '/times/ER'], 'files', 't*100v1*t12r*.txt');
er200v1t12 = struct('sname', '200v1', 'csize', 200, 'folder', [datafolder '/times/ER'], 'files', 't*200v1*t12r*.txt');
er400v1t12 = struct('sname', '400v1', 'csize', 400, 'folder', [datafolder '/times/ER'], 'files', 't*400v1*t12r*.txt');
er800v1t12 = struct('sname', '800v1', 'csize', 800, 'folder', [datafolder '/times/ER'], 'files', 't*800v1*t12r*.txt');
er1600v1t12 = struct('sname', '1600v1', 'csize', 1600, 'folder', [datafolder '/times/ER'], 'files', 't*1600v1*t12r*.txt');
erv1t12 = {er100v1t12, er200v1t12, er400v1t12, er800v1t12, er1600v1t12};

% Specify Java OD implementation specs (runs with 12 threads, b = 500)
od100v1t12 = struct('sname', '100v1', 'csize', 100, 'folder', [datafolder '/times/OD'], 'files', 't*100v1*b500t12r*.txt');
od200v1t12 = struct('sname', '200v1', 'csize', 200, 'folder', [datafolder '/times/OD'], 'files', 't*200v1*b500t12r*.txt');
od400v1t12 = struct('sname', '400v1', 'csize', 400, 'folder', [datafolder '/times/OD'], 'files', 't*400v1*b500t12r*.txt');
od800v1t12 = struct('sname', '800v1', 'csize', 800, 'folder', [datafolder '/times/OD'], 'files', 't*800v1*b500t12r*.txt');
od1600v1t12 = struct('sname', '1600v1', 'csize', 1600, 'folder', [datafolder '/times/OD'], 'files', 't*1600v1*b500t12r*.txt');
odv1t12 = {od100v1t12, od200v1t12, od400v1t12, od800v1t12, od1600v1t12};

% Plot speedup of multiple parallel implementations against NetLogo implementation
% This plot is figure 4a of reference [1]
speedup(1, 1, 'NL', nlv1, 'ST', stv1, 'EQ', eqv1t12, 'EX', exv1t12, 'ER', erv1t12, 'OD', odv1t12);

% Place legend in a better position
legend(gca, 'Location', 'NorthWest');

% Plot speedup of multiple parallel implementations against Java ST implementation
% This plot is figure 4b of reference [1]
speedup(1, 1, 'ST', stv1, 'EQ', eqv1t12, 'EX', exv1t12, 'ER', erv1t12, 'OD', odv1t12);

% Place legend in a better position
legend(gca, 'Location', 'NorthOutside', 'Orientation', 'horizontal');

5.10. Scalability of the different implementations for increasing model sizes

In a slightly more complex scenario than the one described in a previous example, here we use the perfstats function to plot the scalability of the different PPHPC implementations for increasing model sizes. Using the variables defined in the previous examples, the following command generates the equivalent to figure 5a of reference [2]:

perfstats(4, 'NL', nlv1, 'ST', stv1, 'EQ', eqv1t12, 'EX', exv1t12, 'ER', erv1t12, 'OD', odv1t12);

5.11. Scalability of parallel implementations for increasing number of threads

The 'computational size', i.e. the csize field, defined in the implementation specs passed to the perfstats function can be used in alternative contexts. In this example, we use the csize field to specify the number of threads used to perform a set of simulation runs or replications. The following commands will plot the scalability of the several PPHPC parallel implementations for an increasing number of threads. The plotted results are equivalent to figure 6d of reference [2]:

% Specify ST implementation specs, note that the data is always the same
% so in practice the scalability will be constant for ST. However, this is a
% nice trick to have a comparison standard in the plot.
st400v2t1 = struct('sname', '400v2', 'csize', 1, 'folder', [datafolder '/times/ST'], 'files', 't*400v2*.txt');
st400v2t2 = struct('sname', '400v2', 'csize', 2, 'folder', [datafolder '/times/ST'], 'files', 't*400v2*.txt');
st400v2t4 = struct('sname', '400v2', 'csize', 4, 'folder', [datafolder '/times/ST'], 'files', 't*400v2*.txt');
st400v2t6 = struct('sname', '400v2', 'csize', 6, 'folder', [datafolder '/times/ST'], 'files', 't*400v2*.txt');
st400v2t8 = struct('sname', '400v2', 'csize', 8, 'folder', [datafolder '/times/ST'], 'files', 't*400v2*.txt');
st400v2t12 = struct('sname', '400v2', 'csize', 12, 'folder', [datafolder '/times/ST'], 'files', 't*400v2*.txt');
st400v2t16 = struct('sname', '400v2', 'csize', 16, 'folder', [datafolder '/times/ST'], 'files', 't*400v2*.txt');
st400v2t24 = struct('sname', '400v2', 'csize', 24, 'folder', [datafolder '/times/ST'], 'files', 't*400v2*.txt');
stv2 = {st400v2t1, st400v2t2, st400v2t4, st400v2t6, st400v2t8, st400v2t12, st400v2t16, st400v2t24};

% Specify the EQ implementation specs for increasing number of threads
eq400v2t1 = struct('sname', '400v2', 'csize', 1, 'folder', [datafolder '/times/EQ'], 'files', 't*400v2*t1r*.txt');
eq400v2t2 = struct('sname', '400v2', 'csize', 2, 'folder', [datafolder '/times/EQ'], 'files', 't*400v2*t2r*.txt');
eq400v2t4 = struct('sname', '400v2', 'csize', 4, 'folder', [datafolder '/times/EQ'], 'files', 't*400v2*t4r*.txt');
eq400v2t6 = struct('sname', '400v2', 'csize', 6, 'folder', [datafolder '/times/EQ'], 'files', 't*400v2*t6r*.txt');
eq400v2t8 = struct('sname', '400v2', 'csize', 8, 'folder', [datafolder '/times/EQ'], 'files', 't*400v2*t8r*.txt');
eq400v2t12 = struct('sname', '400v2', 'csize', 12, 'folder', [datafolder '/times/EQ'], 'files', 't*400v2*t12r*.txt');
eq400v2t16 = struct('sname', '400v2', 'csize', 16, 'folder', [datafolder '/times/EQ'], 'files', 't*400v2*t16r*.txt');
eq400v2t24 = struct('sname', '400v2', 'csize', 24, 'folder', [datafolder '/times/EQ'], 'files', 't*400v2*t24r*.txt');
eqv2 = {eq400v2t1, eq400v2t2, eq400v2t4, eq400v2t6, eq400v2t8, eq400v2t12, eq400v2t16, eq400v2t24};

% Specify the EX implementation specs for increasing number of threads
ex400v2t1 = struct('sname', '400v2', 'csize', 1, 'folder', [datafolder '/times/EX'], 'files', 't*400v2*t1r*.txt');
ex400v2t2 = struct('sname', '400v2', 'csize', 2, 'folder', [datafolder '/times/EX'], 'files', 't*400v2*t2r*.txt');
ex400v2t4 = struct('sname', '400v2', 'csize', 4, 'folder', [datafolder '/times/EX'], 'files', 't*400v2*t4r*.txt');
ex400v2t6 = struct('sname', '400v2', 'csize', 6, 'folder', [datafolder '/times/EX'], 'files', 't*400v2*t6r*.txt');
ex400v2t8 = struct('sname', '400v2', 'csize', 8, 'folder', [datafolder '/times/EX'], 'files', 't*400v2*t8r*.txt');
ex400v2t12 = struct('sname', '400v2', 'csize', 12, 'folder', [datafolder '/times/EX'], 'files', 't*400v2*t12r*.txt');
ex400v2t16 = struct('sname', '400v2', 'csize', 16, 'folder', [datafolder '/times/EX'], 'files', 't*400v2*t16r*.txt');
ex400v2t24 = struct('sname', '400v2', 'csize', 24, 'folder', [datafolder '/times/EX'], 'files', 't*400v2*t24r*.txt');
exv2 = {ex400v2t1, ex400v2t2, ex400v2t4, ex400v2t6, ex400v2t8, ex400v2t12, ex400v2t16, ex400v2t24};

% Specify the ER implementation specs for increasing number of threads
er400v2t1 = struct('sname', '400v2', 'csize', 1, 'folder', [datafolder '/times/ER'], 'files', 't*400v2*t1r*.txt');
er400v2t2 = struct('sname', '400v2', 'csize', 2, 'folder', [datafolder '/times/ER'], 'files', 't*400v2*t2r*.txt');
er400v2t4 = struct('sname', '400v2', 'csize', 4, 'folder', [datafolder '/times/ER'], 'files', 't*400v2*t4r*.txt');
er400v2t6 = struct('sname', '400v2', 'csize', 6, 'folder', [datafolder '/times/ER'], 'files', 't*400v2*t6r*.txt');
er400v2t8 = struct('sname', '400v2', 'csize', 8, 'folder', [datafolder '/times/ER'], 'files', 't*400v2*t8r*.txt');
er400v2t12 = struct('sname', '400v2', 'csize', 12, 'folder', [datafolder '/times/ER'], 'files', 't*400v2*t12r*.txt');
er400v2t16 = struct('sname', '400v2', 'csize', 16, 'folder', [datafolder '/times/ER'], 'files', 't*400v2*t16r*.txt');
er400v2t24 = struct('sname', '400v2', 'csize', 24, 'folder', [datafolder '/times/ER'], 'files', 't*400v2*t24r*.txt');
erv2 = {er400v2t1, er400v2t2, er400v2t4, er400v2t6, er400v2t8, er400v2t12, er400v2t16, er400v2t24};

% Specify the OD implementation specs for increasing number of threads (b = 500)
od400v2t1 = struct('sname', '400v2', 'csize', 1, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b500t1r*.txt');
od400v2t2 = struct('sname', '400v2', 'csize', 2, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b500t2r*.txt');
od400v2t4 = struct('sname', '400v2', 'csize', 4, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b500t4r*.txt');
od400v2t6 = struct('sname', '400v2', 'csize', 6, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b500t6r*.txt');
od400v2t8 = struct('sname', '400v2', 'csize', 8, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b500t8r*.txt');
od400v2t12 = struct('sname', '400v2', 'csize', 12, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b500t12r*.txt');
od400v2t16 = struct('sname', '400v2', 'csize', 16, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b500t16r*.txt');
od400v2t24 = struct('sname', '400v2', 'csize', 24, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b500t24r*.txt');
odv2 = {od400v2t1, od400v2t2, od400v2t4, od400v2t6, od400v2t8, od400v2t12, od400v2t16, od400v2t24};

% Use a linear plot (first parameter = 1)
perfstats(1, 'ST', stv2, 'EQ', eqv2, 'EX', exv2, 'ER', erv2, 'OD', odv2);

% Move legend to a better position
legend(gca, 'Location', 'northeast');

5.12. Performance of OD strategy for different values of b

For this example, in yet another possible use of the perfstats function, we use the csize field to specify the value of the b parameter of the PPHPC model Java OD variant. This allows us to analyze the performance of the OD parallelization strategy for different values of b. The plot created by the following commands is equivalent to figure 7b of reference [2]:

% Specify the OD implementation specs for size 100 and increasing values of b
od100v2b20 = struct('sname', 'b=20', 'csize', 20, 'folder', [datafolder '/times/OD'], 'files', 't*100v2*b20t12r*.txt');
od100v2b50 = struct('sname', 'b=50', 'csize', 50, 'folder', [datafolder '/times/OD'], 'files', 't*100v2*b50t12r*.txt');
od100v2b100 = struct('sname', 'b=100', 'csize', 100, 'folder', [datafolder '/times/OD'], 'files', 't*100v2*b100t12r*.txt');
od100v2b200 = struct('sname', 'b=200', 'csize', 200, 'folder', [datafolder '/times/OD'], 'files', 't*100v2*b200t12r*.txt');
od100v2b500 = struct('sname', 'b=500', 'csize', 500, 'folder', [datafolder '/times/OD'], 'files', 't*100v2*b500t12r*.txt');
od100v2b1000 = struct('sname', 'b=1000', 'csize', 1000, 'folder', [datafolder '/times/OD'], 'files', 't*100v2*b1000t12r*.txt');
od100v2b2000 = struct('sname', 'b=2000', 'csize', 2000, 'folder', [datafolder '/times/OD'], 'files', 't*100v2*b2000t12r*.txt');
od100v2b5000 = struct('sname', 'b=5000', 'csize', 5000, 'folder', [datafolder '/times/OD'], 'files', 't*100v2*b5000t12r*.txt');
od100v2 = {od100v2b20, od100v2b50, od100v2b100, od100v2b200, od100v2b500, od100v2b1000, od100v2b2000, od100v2b5000};

% Specify the OD implementation specs for size 200 and increasing values of b
od200v2b20 = struct('sname', 'b=20', 'csize', 20, 'folder', [datafolder '/times/OD'], 'files', 't*200v2*b20t12r*.txt');
od200v2b50 = struct('sname', 'b=50', 'csize', 50, 'folder', [datafolder '/times/OD'], 'files', 't*200v2*b50t12r*.txt');
od200v2b100 = struct('sname', 'b=100', 'csize', 100, 'folder', [datafolder '/times/OD'], 'files', 't*200v2*b100t12r*.txt');
od200v2b200 = struct('sname', 'b=200', 'csize', 200, 'folder', [datafolder '/times/OD'], 'files', 't*200v2*b200t12r*.txt');
od200v2b500 = struct('sname', 'b=500', 'csize', 500, 'folder', [datafolder '/times/OD'], 'files', 't*200v2*b500t12r*.txt');
od200v2b1000 = struct('sname', 'b=1000', 'csize', 1000, 'folder', [datafolder '/times/OD'], 'files', 't*200v2*b1000t12r*.txt');
od200v2b2000 = struct('sname', 'b=2000', 'csize', 2000, 'folder', [datafolder '/times/OD'], 'files', 't*200v2*b2000t12r*.txt');
od200v2b5000 = struct('sname', 'b=5000', 'csize', 5000, 'folder', [datafolder '/times/OD'], 'files', 't*200v2*b5000t12r*.txt');
od200v2 = {od200v2b20, od200v2b50, od200v2b100, od200v2b200, od200v2b500, od200v2b1000, od200v2b2000, od200v2b5000};

% Specify the OD implementation specs for size 400 and increasing values of b
od400v2b20 = struct('sname', 'b=20', 'csize', 20, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b20t12r*.txt');
od400v2b50 = struct('sname', 'b=50', 'csize', 50, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b50t12r*.txt');
od400v2b100 = struct('sname', 'b=100', 'csize', 100, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b100t12r*.txt');
od400v2b200 = struct('sname', 'b=200', 'csize', 200, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b200t12r*.txt');
od400v2b500 = struct('sname', 'b=500', 'csize', 500, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b500t12r*.txt');
od400v2b1000 = struct('sname', 'b=1000', 'csize', 1000, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b1000t12r*.txt');
od400v2b2000 = struct('sname', 'b=2000', 'csize', 2000, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b2000t12r*.txt');
od400v2b5000 = struct('sname', 'b=5000', 'csize', 5000, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b5000t12r*.txt');
od400v2 = {od400v2b20, od400v2b50, od400v2b100, od400v2b200, od400v2b500, od400v2b1000, od400v2b2000, od400v2b5000};

% Specify the OD implementation specs for size 800 and increasing values of b
od800v2b20 = struct('sname', 'b=20', 'csize', 20, 'folder', [datafolder '/times/OD'], 'files', 't*800v2*b20t12r*.txt');
od800v2b50 = struct('sname', 'b=50', 'csize', 50, 'folder', [datafolder '/times/OD'], 'files', 't*800v2*b50t12r*.txt');
od800v2b100 = struct('sname', 'b=100', 'csize', 100, 'folder', [datafolder '/times/OD'], 'files', 't*800v2*b100t12r*.txt');
od800v2b200 = struct('sname', 'b=200', 'csize', 200, 'folder', [datafolder '/times/OD'], 'files', 't*800v2*b200t12r*.txt');
od800v2b500 = struct('sname', 'b=500', 'csize', 500, 'folder', [datafolder '/times/OD'], 'files', 't*800v2*b500t12r*.txt');
od800v2b1000 = struct('sname', 'b=1000', 'csize', 1000, 'folder', [datafolder '/times/OD'], 'files', 't*800v2*b1000t12r*.txt');
od800v2b2000 = struct('sname', 'b=2000', 'csize', 2000, 'folder', [datafolder '/times/OD'], 'files', 't*800v2*b2000t12r*.txt');
od800v2b5000 = struct('sname', 'b=5000', 'csize', 5000, 'folder', [datafolder '/times/OD'], 'files', 't*800v2*b5000t12r*.txt');
od800v2 = {od800v2b20, od800v2b50, od800v2b100, od800v2b200, od800v2b500, od800v2b1000, od800v2b2000, od800v2b5000};

% Specify the OD implementation specs for size 1600 and increasing values of b
od1600v2b20 = struct('sname', 'b=20', 'csize', 20, 'folder', [datafolder '/times/OD'], 'files', 't*1600v2*b20t12r*.txt');
od1600v2b50 = struct('sname', 'b=50', 'csize', 50, 'folder', [datafolder '/times/OD'], 'files', 't*1600v2*b50t12r*.txt');
od1600v2b100 = struct('sname', 'b=100', 'csize', 100, 'folder', [datafolder '/times/OD'], 'files', 't*1600v2*b100t12r*.txt');
od1600v2b200 = struct('sname', 'b=200', 'csize', 200, 'folder', [datafolder '/times/OD'], 'files', 't*1600v2*b200t12r*.txt');
od1600v2b500 = struct('sname', 'b=500', 'csize', 500, 'folder', [datafolder '/times/OD'], 'files', 't*1600v2*b500t12r*.txt');
od1600v2b1000 = struct('sname', 'b=1000', 'csize', 1000, 'folder', [datafolder '/times/OD'], 'files', 't*1600v2*b1000t12r*.txt');
od1600v2b2000 = struct('sname', 'b=2000', 'csize', 2000, 'folder', [datafolder '/times/OD'], 'files', 't*1600v2*b2000t12r*.txt');
od1600v2b5000 = struct('sname', 'b=5000', 'csize', 5000, 'folder', [datafolder '/times/OD'], 'files', 't*1600v2*b5000t12r*.txt');
od1600v2 = {od1600v2b20, od1600v2b50, od1600v2b100, od1600v2b200, od1600v2b500, od1600v2b1000, od1600v2b2000, od1600v2b5000};

% Show plot
perfstats(4, '100', od100v2, '200', od200v2, '400', od400v2, '800', od800v2, '1600', od1600v2);

% Place legend in a better position
legend(gca, 'Location', 'NorthOutside', 'Orientation', 'horizontal')

5.13. Custom performance plot

As previously discussed, it is possible to generate custom plots using the data returned by perfstats and speedup. The following code snippet produces a customized version of the plot generated in the previous example. The resulting image is a publication quality equivalent of figure 7b in reference [2]:

% Get data from perfstats function
p = perfstats(0, '100', od100v2, '200', od200v2, '400', od400v2, '800', od800v2, '1600', od1600v2);

% Values of the b parameter
bvals = [20 50 100 200 500 1000 2000 5000];

% Generate basic plot with black lines
h = loglog(bvals, p', 'k');
set(gca, 'XTick', bvals);

% Set marker styles
set(h(1), 'Marker', 'o', 'MarkerFaceColor', [0.7 0.7 0.7]);
set(h(2), 'Marker', 's', 'MarkerFaceColor', [0.7 0.7 0.7]);
set(h(3), 'Marker', 'o', 'MarkerFaceColor', 'w');
set(h(4), 'Marker', '^', 'MarkerFaceColor', 'k');
set(h(5), 'Marker', 'd', 'MarkerFaceColor', [0.7 0.7 0.7]);

% Draw bold circles indicating best times for each size/b combination
grid on;
hold on;
[my, mi] = min(p, [], 2);
plot(bvals(mi), my, 'ok', 'MarkerSize', 10, 'LineWidth', 2);

% Set limits and add labels
xlim([min(bvals) max(bvals)]);
xlabel('Block size, {\itb}');
ylabel('Time ({\its})');

% Set legend
legend({'100', '200', '400', '800', '1600'}, 'Location', 'NorthOutside', 'Orientation', 'horizontal');

Although the figure looks appropriate for publication purposes, it can still be improved by converting it to native LaTeX via the matlab2tikz script:

% Small adjustments so that figure looks better when converted
grid minor;
set(gca, 'XTickLabel', bvals);

% Convert figure to LaTeX
cleanfigure();
matlab2tikz('standalone', true, 'filename', 'image.tex');

Compiling the image.tex file with a LaTeX engine yields the following figure:

5.14. Show a table instead of a plot

The times_table and times_table_f functions can be used to create performance tables formatted in plain text or LaTeX. Using the data defined in a previous example, the following commands produce a plain text table comparing the NetLogo (NL) and Java single-thread (ST) PPHPC implementations for sizes 100 to 1600, parameter set 1:

% Put data in table format
tdata = times_table(1, 'NL', nlv1, 'ST', stv1);

% Print a plain text table
times_table_f(0, 'NL vs ST', tdata)

                  -----------------------------------------------
                  |                        NL vs ST             |
-----------------------------------------------------------------
| Imp.   | Set.   |   t(s)     |   std     |  std%  | x     NL  |
-----------------------------------------------------------------
|     NL |  100v1 |       15.9 |     0.359 |   2.26 |         1 |
|        |  200v1 |        100 |      1.25 |   1.25 |         1 |
|        |  400v1 |        481 |      6.02 |   1.25 |         1 |
|        |  800v1 |   2.08e+03 |      9.75 |   0.47 |         1 |
|        | 1600v1 |   9.12e+03 |      94.1 |   1.03 |         1 |
-----------------------------------------------------------------
|     ST |  100v1 |       2.71 |    0.0223 |   0.82 |      5.85 |
|        |  200v1 |       12.2 |     0.219 |   1.80 |      8.24 |
|        |  400v1 |       84.4 |      2.83 |   3.35 |      5.71 |
|        |  800v1 |        383 |      5.04 |   1.32 |      5.43 |
|        | 1600v1 |   1.68e+03 |      78.4 |   4.67 |      5.43 |
-----------------------------------------------------------------

In order to produce the equivalent LaTeX table, we set the first parameter to 1 instead of 0:

% Print a Latex table
times_table_f(1, 'NL vs ST', tdata)

5.15. Complex tables

The times_table and times_table_f functions are capable of producing more complex tables. In this example, we show how to reproduce table 7 of reference [2], containing times and speedups for multiple model implementations, different sizes and both parameter sets, showing speedups of all implementations versus the NetLogo and Java ST versions.

The first step consists of defining the implementation specs:

% %%%%%%%%%%%%%%%%%%%%%%%%% %
% Specs for parameter set 1 %
% %%%%%%%%%%%%%%%%%%%%%%%%% %

% Define NetLogo implementation specs, parameter set 1
nl100v1 = struct('sname', '100', 'csize', 100, 'folder', [datafolder '/times/NL'], 'files', 't*100v1*.txt');
nl200v1 = struct('sname', '200', 'csize', 200, 'folder', [datafolder '/times/NL'], 'files', 't*200v1*.txt');
nl400v1 = struct('sname', '400', 'csize', 400, 'folder', [datafolder '/times/NL'], 'files', 't*400v1*.txt');
nl800v1 = struct('sname', '800', 'csize', 800, 'folder', [datafolder '/times/NL'], 'files', 't*800v1*.txt');
nl1600v1 = struct('sname', '1600', 'csize', 1600, 'folder', [datafolder '/times/NL'], 'files', 't*1600v1*.txt');
nlv1 = {nl100v1, nl200v1, nl400v1, nl800v1, nl1600v1};

% Define Java ST implementation specs, parameter set 1
st100v1 = struct('sname', '100', 'csize', 100, 'folder', [datafolder '/times/ST'], 'files', 't*100v1*.txt');
st200v1 = struct('sname', '200', 'csize', 200, 'folder', [datafolder '/times/ST'], 'files', 't*200v1*.txt');
st400v1 = struct('sname', '400', 'csize', 400, 'folder', [datafolder '/times/ST'], 'files', 't*400v1*.txt');
st800v1 = struct('sname', '800', 'csize', 800, 'folder', [datafolder '/times/ST'], 'files', 't*800v1*.txt');
st1600v1 = struct('sname', '1600', 'csize', 1600, 'folder', [datafolder '/times/ST'], 'files', 't*1600v1*.txt');
stv1 = {st100v1, st200v1, st400v1, st800v1, st1600v1};

% Define Java EQ implementation specs (runs with 12 threads), parameter set 1
eq100v1t12 = struct('sname', '100', 'csize', 100, 'folder', [datafolder '/times/EQ'], 'files', 't*100v1*t12r*.txt');
eq200v1t12 = struct('sname', '200', 'csize', 200, 'folder', [datafolder '/times/EQ'], 'files', 't*200v1*t12r*.txt');
eq400v1t12 = struct('sname', '400', 'csize', 400, 'folder', [datafolder '/times/EQ'], 'files', 't*400v1*t12r*.txt');
eq800v1t12 = struct('sname', '800', 'csize', 800, 'folder', [datafolder '/times/EQ'], 'files', 't*800v1*t12r*.txt');
eq1600v1t12 = struct('sname', '1600', 'csize', 1600, 'folder', [datafolder '/times/EQ'], 'files', 't*1600v1*t12r*.txt');
eqv1t12 = {eq100v1t12, eq200v1t12, eq400v1t12, eq800v1t12, eq1600v1t12};

% Define Java EX implementation specs (runs with 12 threads), parameter set 1
ex100v1t12 = struct('sname', '100', 'csize', 100, 'folder', [datafolder '/times/EX'], 'files', 't*100v1*t12r*.txt');
ex200v1t12 = struct('sname', '200', 'csize', 200, 'folder', [datafolder '/times/EX'], 'files', 't*200v1*t12r*.txt');
ex400v1t12 = struct('sname', '400', 'csize', 400, 'folder', [datafolder '/times/EX'], 'files', 't*400v1*t12r*.txt');
ex800v1t12 = struct('sname', '800', 'csize', 800, 'folder', [datafolder '/times/EX'], 'files', 't*800v1*t12r*.txt');
ex1600v1t12 = struct('sname', '1600', 'csize', 1600, 'folder', [datafolder '/times/EX'], 'files', 't*1600v1*t12r*.txt');
exv1t12 = {ex100v1t12, ex200v1t12, ex400v1t12, ex800v1t12, ex1600v1t12};

% Define Java ER implementation specs (runs with 12 threads), parameter set 1
er100v1t12 = struct('sname', '100', 'csize', 100, 'folder', [datafolder '/times/ER'], 'files', 't*100v1*t12r*.txt');
er200v1t12 = struct('sname', '200', 'csize', 200, 'folder', [datafolder '/times/ER'], 'files', 't*200v1*t12r*.txt');
er400v1t12 = struct('sname', '400', 'csize', 400, 'folder', [datafolder '/times/ER'], 'files', 't*400v1*t12r*.txt');
er800v1t12 = struct('sname', '800', 'csize', 800, 'folder', [datafolder '/times/ER'], 'files', 't*800v1*t12r*.txt');
er1600v1t12 = struct('sname', '1600', 'csize', 1600, 'folder', [datafolder '/times/ER'], 'files', 't*1600v1*t12r*.txt');
erv1t12 = {er100v1t12, er200v1t12, er400v1t12, er800v1t12, er1600v1t12};

% Define Java OD implementation specs (runs with 12 threads, b = 500), parameter set 1
od100v1t12 = struct('sname', '100', 'csize', 100, 'folder', [datafolder '/times/OD'], 'files', 't*100v1*b500t12r*.txt');
od200v1t12 = struct('sname', '200', 'csize', 200, 'folder', [datafolder '/times/OD'], 'files', 't*200v1*b500t12r*.txt');
od400v1t12 = struct('sname', '400', 'csize', 400, 'folder', [datafolder '/times/OD'], 'files', 't*400v1*b500t12r*.txt');
od800v1t12 = struct('sname', '800', 'csize', 800, 'folder', [datafolder '/times/OD'], 'files', 't*800v1*b500t12r*.txt');
od1600v1t12 = struct('sname', '1600', 'csize', 1600, 'folder', [datafolder '/times/OD'], 'files', 't*1600v1*b500t12r*.txt');
odv1t12 = {od100v1t12, od200v1t12, od400v1t12, od800v1t12, od1600v1t12};

% %%%%%%%%%%%%%%%%%%%%%%%%% %
% Specs for parameter set 2 %
% %%%%%%%%%%%%%%%%%%%%%%%%% %

% Define NetLogo implementation specs, parameter set 2
nl100v2 = struct('sname', '100', 'csize', 100, 'folder', [datafolder '/times/NL'], 'files', 't*100v2*.txt');
nl200v2 = struct('sname', '200', 'csize', 200, 'folder', [datafolder '/times/NL'], 'files', 't*200v2*.txt');
nl400v2 = struct('sname', '400', 'csize', 400, 'folder', [datafolder '/times/NL'], 'files', 't*400v2*.txt');
nl800v2 = struct('sname', '800', 'csize', 800, 'folder', [datafolder '/times/NL'], 'files', 't*800v2*.txt');
nl1600v2 = struct('sname', '1600', 'csize', 1600, 'folder', [datafolder '/times/NL'], 'files', 't*1600v2*.txt');
nlv2 = {nl100v2, nl200v2, nl400v2, nl800v2, nl1600v2};

% Define Java ST implementation specs, parameter set 2
st100v2 = struct('sname', '100', 'csize', 100, 'folder', [datafolder '/times/ST'], 'files', 't*100v2*.txt');
st200v2 = struct('sname', '200', 'csize', 200, 'folder', [datafolder '/times/ST'], 'files', 't*200v2*.txt');
st400v2 = struct('sname', '400', 'csize', 400, 'folder', [datafolder '/times/ST'], 'files', 't*400v2*.txt');
st800v2 = struct('sname', '800', 'csize', 800, 'folder', [datafolder '/times/ST'], 'files', 't*800v2*.txt');
st1600v2 = struct('sname', '1600', 'csize', 1600, 'folder', [datafolder '/times/ST'], 'files', 't*1600v2*.txt');
stv2 = {st100v2, st200v2, st400v2, st800v2, st1600v2};

% Define Java EQ implementation specs (runs with 12 threads), parameter set 2
eq100v2t12 = struct('sname', '100', 'csize', 100, 'folder', [datafolder '/times/EQ'], 'files', 't*100v2*t12r*.txt');
eq200v2t12 = struct('sname', '200', 'csize', 200, 'folder', [datafolder '/times/EQ'], 'files', 't*200v2*t12r*.txt');
eq400v2t12 = struct('sname', '400', 'csize', 400, 'folder', [datafolder '/times/EQ'], 'files', 't*400v2*t12r*.txt');
eq800v2t12 = struct('sname', '800', 'csize', 800, 'folder', [datafolder '/times/EQ'], 'files', 't*800v2*t12r*.txt');
eq1600v2t12 = struct('sname', '1600', 'csize', 1600, 'folder', [datafolder '/times/EQ'], 'files', 't*1600v2*t12r*.txt');
eqv2t12 = {eq100v2t12, eq200v2t12, eq400v2t12, eq800v2t12, eq1600v2t12};

% Define Java EX implementation specs (runs with 12 threads), parameter set 2
ex100v2t12 = struct('sname', '100', 'csize', 100, 'folder', [datafolder '/times/EX'], 'files', 't*100v2*t12r*.txt');
ex200v2t12 = struct('sname', '200', 'csize', 200, 'folder', [datafolder '/times/EX'], 'files', 't*200v2*t12r*.txt');
ex400v2t12 = struct('sname', '400', 'csize', 400, 'folder', [datafolder '/times/EX'], 'files', 't*400v2*t12r*.txt');
ex800v2t12 = struct('sname', '800', 'csize', 800, 'folder', [datafolder '/times/EX'], 'files', 't*800v2*t12r*.txt');
ex1600v2t12 = struct('sname', '1600', 'csize', 1600, 'folder', [datafolder '/times/EX'], 'files', 't*1600v2*t12r*.txt');
exv2t12 = {ex100v2t12, ex200v2t12, ex400v2t12, ex800v2t12, ex1600v2t12};

% Define Java ER implementation specs (runs with 12 threads), parameter set 2
er100v2t12 = struct('sname', '100', 'csize', 100, 'folder', [datafolder '/times/ER'], 'files', 't*100v2*t12r*.txt');
er200v2t12 = struct('sname', '200', 'csize', 200, 'folder', [datafolder '/times/ER'], 'files', 't*200v2*t12r*.txt');
er400v2t12 = struct('sname', '400', 'csize', 400, 'folder', [datafolder '/times/ER'], 'files', 't*400v2*t12r*.txt');
er800v2t12 = struct('sname', '800', 'csize', 800, 'folder', [datafolder '/times/ER'], 'files', 't*800v2*t12r*.txt');
er1600v2t12 = struct('sname', '1600', 'csize', 1600, 'folder', [datafolder '/times/ER'], 'files', 't*1600v2*t12r*.txt');
erv2t12 = {er100v2t12, er200v2t12, er400v2t12, er800v2t12, er1600v2t12};

% Define Java OD implementation specs (runs with 12 threads, b = 500), parameter set 2
od100v2t12 = struct('sname', '100', 'csize', 100, 'folder', [datafolder '/times/OD'], 'files', 't*100v2*b500t12r*.txt');
od200v2t12 = struct('sname', '200', 'csize', 200, 'folder', [datafolder '/times/OD'], 'files', 't*200v2*b500t12r*.txt');
od400v2t12 = struct('sname', '400', 'csize', 400, 'folder', [datafolder '/times/OD'], 'files', 't*400v2*b500t12r*.txt');
od800v2t12 = struct('sname', '800', 'csize', 800, 'folder', [datafolder '/times/OD'], 'files', 't*800v2*b500t12r*.txt');
od1600v2t12 = struct('sname', '1600', 'csize', 1600, 'folder', [datafolder '/times/OD'], 'files', 't*1600v2*b500t12r*.txt');
odv2t12 = {od100v2t12, od200v2t12, od400v2t12, od800v2t12, od1600v2t12};

After the implementation specs are defined, we create two intermediate tables:

% %%%%%%%%%%%%%%%%%%% %
% Intermediate tables %
% %%%%%%%%%%%%%%%%%%% %

% Parameter set 1
data_v1 = times_table([1 2], 'NL', nlv1, 'ST', stv1, 'EQ', eqv1t12, 'EX', exv1t12, 'ER', erv1t12, 'OD', odv1t12);

% Parameter set 2
data_v2 = times_table([1 2], 'NL', nlv2, 'ST', stv2, 'EQ', eqv2t12, 'EX', exv2t12, 'ER', erv2t12, 'OD', odv2t12);

We first print a plain text table, to check how the information is organized:

% %%%%%%%%%%%% %
% Print tables %
% %%%%%%%%%%%% %

% Plain text table
times_table_f(0, 'Param. set 1', data_v1, 'Param. set 2', data_v2)

                  ---------------------------------------------------------------------------------------------------------------------
                  |                    Param. set 1                         |                    Param. set 2                         |
---------------------------------------------------------------------------------------------------------------------------------------
| Imp.   | Set.   |   t(s)     |   std     |  std%  | x     NL  | x     ST  |   t(s)     |   std     |  std%  | x     NL  | x     ST  |
---------------------------------------------------------------------------------------------------------------------------------------
|     NL |    100 |       15.9 |     0.359 |   2.26 |         1 |     0.171 |       32.2 |     0.686 |   2.13 |         1 |     0.166 |
|        |    200 |        100 |      1.25 |   1.25 |         1 |     0.121 |        245 |       1.5 |   0.61 |         1 |     0.147 |
|        |    400 |        481 |      6.02 |   1.25 |         1 |     0.175 |   1.07e+03 |      3.63 |   0.34 |         1 |     0.148 |
|        |    800 |   2.08e+03 |      9.75 |   0.47 |         1 |     0.184 |   4.54e+03 |      23.2 |   0.51 |         1 |     0.154 |
|        |   1600 |   9.12e+03 |      94.1 |   1.03 |         1 |     0.184 |   1.96e+04 |      90.9 |   0.46 |         1 |     0.151 |
---------------------------------------------------------------------------------------------------------------------------------------
|     ST |    100 |       2.71 |    0.0223 |   0.82 |      5.85 |         1 |       5.34 |     0.051 |   0.96 |      6.03 |         1 |
|        |    200 |       12.2 |     0.219 |   1.80 |      8.24 |         1 |       36.1 |     0.178 |   0.49 |      6.79 |         1 |
|        |    400 |       84.4 |      2.83 |   3.35 |      5.71 |         1 |        159 |     0.474 |   0.30 |      6.76 |         1 |
|        |    800 |        383 |      5.04 |   1.32 |      5.43 |         1 |        700 |      3.67 |   0.52 |      6.49 |         1 |
|        |   1600 |   1.68e+03 |      78.4 |   4.67 |      5.43 |         1 |   2.96e+03 |       123 |   4.15 |      6.61 |         1 |
---------------------------------------------------------------------------------------------------------------------------------------
|     EQ |    100 |       1.55 |    0.0251 |   1.62 |      10.2 |      1.75 |       1.87 |    0.0287 |   1.53 |      17.2 |      2.85 |
|        |    200 |       2.81 |     0.113 |   4.01 |      35.6 |      4.32 |       7.08 |     0.126 |   1.78 |      34.6 |       5.1 |
|        |    400 |       19.5 |     0.214 |   1.10 |      24.7 |      4.34 |       31.2 |     0.207 |   0.66 |      34.5 |       5.1 |
|        |    800 |       86.1 |      4.26 |   4.95 |      24.1 |      4.45 |        125 |      4.15 |   3.32 |      36.2 |      5.58 |
|        |   1600 |        279 |      4.04 |   1.45 |      32.6 |      6.01 |        487 |      8.48 |   1.74 |      40.1 |      6.07 |
---------------------------------------------------------------------------------------------------------------------------------------
|     EX |    100 |       1.53 |    0.0291 |   1.90 |      10.4 |      1.78 |       2.14 |    0.0587 |   2.75 |      15.1 |       2.5 |
|        |    200 |       2.91 |     0.107 |   3.69 |      34.4 |      4.18 |       8.08 |     0.141 |   1.74 |      30.4 |      4.47 |
|        |    400 |       19.6 |     0.302 |   1.54 |      24.6 |      4.31 |       34.2 |     0.527 |   1.54 |      31.4 |      4.65 |
|        |    800 |       86.5 |      5.46 |   6.31 |        24 |      4.42 |        139 |      5.96 |   4.29 |      32.6 |      5.03 |
|        |   1600 |        282 |      5.49 |   1.95 |      32.4 |      5.96 |        532 |      5.24 |   0.99 |      36.8 |      5.56 |
---------------------------------------------------------------------------------------------------------------------------------------
|     ER |    100 |       7.29 |     0.325 |   4.46 |      2.18 |     0.372 |       8.39 |     0.148 |   1.76 |      3.83 |     0.636 |
|        |    200 |       16.4 |      0.77 |   4.68 |       6.1 |      0.74 |       17.9 |     0.252 |   1.41 |      13.7 |      2.02 |
|        |    400 |       37.2 |     0.204 |   0.55 |        13 |      2.27 |       45.9 |     0.285 |   0.62 |      23.4 |      3.46 |
|        |    800 |        111 |      3.37 |   3.02 |      18.6 |      3.43 |        159 |      3.21 |   2.02 |      28.5 |      4.39 |
|        |   1600 |        332 |       3.5 |   1.06 |      27.5 |      5.06 |        553 |      8.03 |   1.45 |      35.3 |      5.34 |
---------------------------------------------------------------------------------------------------------------------------------------
|     OD |    100 |       1.36 |    0.0158 |   1.16 |      11.7 |         2 |          2 |    0.0331 |   1.66 |      16.1 |      2.68 |
|        |    200 |       2.68 |      0.07 |   2.61 |      37.4 |      4.54 |       6.64 |     0.109 |   1.64 |        37 |      5.44 |
|        |    400 |       19.2 |     0.199 |   1.04 |      25.1 |       4.4 |       29.1 |     0.122 |   0.42 |      36.9 |      5.46 |
|        |    800 |       82.9 |      2.27 |   2.73 |        25 |      4.61 |        118 |         3 |   2.55 |      38.6 |      5.95 |
|        |   1600 |        292 |      8.51 |   2.91 |      31.2 |      5.74 |        479 |      9.32 |   1.95 |      40.8 |      6.18 |
---------------------------------------------------------------------------------------------------------------------------------------

Finally, we produce a LaTeX table, as shown in reference [2]:

% LaTex table
times_table_f(1, 'Param. set 1', data_v1, 'Param. set 2', data_v2)

6. References

[1] Fachada N, Lopes VV, Martins RC, Rosa AC., (2016). PerfAndPubTools – Tools for Software Performance Analysis and Publishing of Results. Journal of Open Research Software. 4(1), p.e18. http://doi.org/10.5334/jors.115

[2] Fachada N, Lopes VV, Martins RC, Rosa AC. (2017) Parallelization strategies for spatial agent-based models. International Journal of Parallel Programming. 45(3):449–481. http://dx.doi.org/10.1007/s10766-015-0399-9 (arXiv preprint)

Files

userguide.md

Latest commit

History

userguide.md

File metadata and controls

PerfAndPubTools

1. What is PerfAndPubTools?

2. Architecture and functions

2.1. Initialization

2.2. Time parsing functions (plugins)

2.3. Base functions

2.4. Speedups against one or more reference implementations

2.5. Pairwise speedups

2.6. Plotting

3. Default benchmark file format and alternative implementations

4. Tutorial - Performance analysis of sorting algorithms

4.1. Extract performance data from a file

4.2. Extract execution times from files in a folder

4.3. Average execution times and standard deviations

4.4. Compare multiple setups within the same implementation

4.5. Same as previous, with a linear plot

4.6. Compare different implementations

4.7. Speedup

4.8. Speedup for multiple algorithms and vector sizes

4.9. Custom speedup plots

4.10. Scalability of the different sorting algorithms for increasing vector sizes

4.11. Custom scalability plots

4.12. Produce a table instead of a plot

4.13. Pairwise speedups

5. A real world case - Performance analysis of a simulation model

5.1. Implementations and setups of the PPHPC agent-based model

5.2. Extract performance data from a file

5.3. Extract execution times from files in a folder

5.4. Average execution times and standard deviations

5.5. Compare multiple setups within the same implementation

5.6. Same as previous, with a log-log plot

5.7. Compare different implementations

5.8. Speedup

5.9. Speedup for multiple parallel implementations and sizes

5.10. Scalability of the different implementations for increasing model sizes

5.11. Scalability of parallel implementations for increasing number of threads

5.12. Performance of OD strategy for different values of b

5.13. Custom performance plot

5.14. Show a table instead of a plot

5.15. Complex tables

6. References