How to structure and design experimental scripts #18

scotts · 2020-07-10T21:55:49Z

The experimental scripts were written with the C++ implementations in mind. Comparing across language implementations would also be interesting, but this means that:

The scripts need to have a concept of language implementation.
Each language will need a benchmark driver that performs equivalent experiments, ideally exposing the same command-line interface.

segeljakt · 2020-07-12T08:55:51Z

In Rust, it should be very easy to setup up a CLI with clap. Should input to the benchmarks be read from standard input or generated inside the program?

I think some algorithms might also have optimisations that should optimally be tuned at compile-time. For example, FiBA can be compiled with different MAX_ARITY. In Rust this can be solved through environment variables.

I'm also wondering, would it be worth bridging the implementations with for example https://github.com/dtolnay/cxx?

scotts · 2020-07-15T14:02:53Z

Using a library for CLI is better than my current ad-hoc approach. :) I should probably pull in some boost library rather than the hackish way I'm currently doing it.

Should input to the benchmarks be read from standard input or generated inside the program?

I think that depends on the kind of benchmark. For some of what we have implemented, the data generation is synthetic. In that case, only the parameters of the benchmark are read from the command line (but not standard in). For the data-driven benchmark, we have so far read a filename from the command line, then opened that file during the run instead of reading from standard in.

We have also had to treat the data driven benchmarks different. For the synthetic benchmarks, it's okay to issue a call such as:

$ benchmark_driver two_stacks 1024 10000000

That is, it's doing a single run of the Two-Stacks algorithm with a window of size 1024 for 10000000 iterations. The benchmark then self-reports the core running time (eliding warmup phase). A command-line-call-per-data-point is fine with synthetic benchmarks. But for data-driven benchmarks, reading the data file alone can take 5 minutes. For those benchmarks, in order to manage the amount of time it takes to run all experiments, we had to read the file once, cache it, then re-use that for many runs. (See DataGenerators.h for our implementation of the file reading and caching.) But that then means when we call the benchmark, we must also tell it to run many different experiments, which we do through a separate input file. (See the data_benchmark for what this looks like.)

I think some algorithms might also have optimisations that should optimally be tuned at compile-time.

We solved this problem by giving them different names. That is, fiba is not actually a name recognized by our benchmarks, but it does recognize fiba4, fiba8, etc. (See, for example, where we determine which SWAG to use.) I never liked the way I ended up implementing that, as I ended up having to implement dozens of helper functions because of all of the different ways the SWAGs were template-parameterized. I have some hope that your clean interfaces, and Rust's macro support, can allow a much cleaner implementation of a similar idea.

I'm also wondering, would it be worth bridging the implementations with for example https://github.com/dtolnay/cxx?

Interesting - that could allow one binary driver for both C++ and Rust. I hadn't even thought of going in that direction.

segeljakt · 2020-07-16T18:55:01Z

I will try and setup a benchmark in criterion. Do you have a link for downloading the datasets? Also, is it possible to generate the synthetic data from C++ and use it as input to Rust or should I generate it directly in Rust?

segeljakt · 2020-07-16T18:58:19Z

Aah, nevermind, I executed the benchmark and it downloaded everything.

scotts · 2020-07-16T19:28:21Z

For the synthetic data, we just increment an integer (I think a uint64_t), so there's no need to do that across languages. That's probably what you ran, I don't think we automatically download the real datasets, since they are so large.

We got the DEBS 2012 Grand Challenge data from: https://debs.org/grand-challenges/2012/. Since we used this dataset in FIFO experiments, we ran a pre-processing script which removes a small amount of out-of-order elements. I'll look into getting this script added to the repo.

The NYC Citi Bike data (used in the FiBA paper) is from: https://www.citibikenyc.com/system-data. We used August - December, 2018.

We'll want some standardized way of getting these datasets. Probably a script of some kind.

segeljakt · 2020-07-16T19:56:54Z

Ok, cool 👍 I have some questions:

Is it in general ok to assume aggregation operators have a constant execution time?
How does C++ deal with overflows/underflows with aggregation operators like sum?
Are the benchmarks here 1:1 with the ones in the papers?

scotts · 2020-07-17T14:31:24Z

Is it in general ok to assume aggregation operators have a constant execution time?

No, but it's also not your "responsibility" as the implementer of the SWAG. What I mean is that the complexity guarantees of a SWAG are always using the aggregation operation as the operation being counted. So when we say, for example, that a SWAG's operations are O(1), it means that we only call the aggregation operation a constant number of times for each operation. That does not preclude the aggregation operation itself from not being constant. The classic example of a non-constant aggregation operation is collect, which returns a list of all elements in the window. That is inherently a O(n) operation.

How does C++ deal with overflows/underflows with aggregation operators like sum?

It doesn't. :) In C/C++, overflows can silently happen. I believe we prevent them by limiting the range of our input data.

Are the benchmarks here 1:1 with the ones in the papers?

No, but all of the benchmarks in the DEBS 2017 and VLDB 2019 are here. We don't have the experiments from VLDB 2015. The C++ benchmarks are:

FIFO, static window, synthetic data, benchmark()
FIFO, dynamic window, synthetic data, dynamic_benchmark()
out-of-order in an adversarial manner, static window, synthetic data, ooo_adversary_benchmark()
out-of-order in an adversarial manner, static window, range query, synthetic data, range_query_benchmark()
out-of-order in an adversarial manner, static window, dual SWAG (to be compared against range query), synthetic data, twin_benchmark()
data driven, with time-based windows, data_benchmark()

The rest of that header file are the helper functions to call them from the drivers. Because some of our SWAGs had slightly different template parameters, there was a combinatorial explosion in helpers.

scotts added the question Further information is requested label Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to structure and design experimental scripts #18

How to structure and design experimental scripts #18

scotts commented Jul 10, 2020

segeljakt commented Jul 12, 2020 •

edited

Loading

scotts commented Jul 15, 2020

segeljakt commented Jul 16, 2020

segeljakt commented Jul 16, 2020

scotts commented Jul 16, 2020

segeljakt commented Jul 16, 2020 •

edited

Loading

scotts commented Jul 17, 2020

How to structure and design experimental scripts #18

How to structure and design experimental scripts #18

Comments

scotts commented Jul 10, 2020

segeljakt commented Jul 12, 2020 • edited Loading

scotts commented Jul 15, 2020

segeljakt commented Jul 16, 2020

segeljakt commented Jul 16, 2020

scotts commented Jul 16, 2020

segeljakt commented Jul 16, 2020 • edited Loading

scotts commented Jul 17, 2020

segeljakt commented Jul 12, 2020 •

edited

Loading

segeljakt commented Jul 16, 2020 •

edited

Loading