Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qs file format as cache file format #304

Merged
merged 15 commits into from
Dec 8, 2020
Merged

qs file format as cache file format #304

merged 15 commits into from
Dec 8, 2020

Conversation

gisler
Copy link
Contributor

@gisler gisler commented Nov 23, 2020

Types of change

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Other (documentation etc)

Pull request checklist

  • Add functionality
  • Add tests
  • Update documentation in man
  • Update website documentation

Proposed changes

As shortly discussed in #225, I implemented "qs" as a possible cache file format (https://CRAN.R-project.org/package=qs). The default stays the "RData" file format for now though.

Design considerations:

  • The cache function doesn't require a loaded project to work. I tried to keep it like this. This, however, has the drawback that using the cache function without a loaded project might not consider possible config overrides.
  • Switching cache file format is possible without affecting the other cache file format at all. So if .RData files already exist and the file format is changed to .qs, new .qs files are created upon loading the project. This includes the .hash files. The following list shows the file extensions of the different cache file formats:
    • RData: .RData and .hash
    • qs: .qs and .qs.hash

The latter also implies that clear.cache only deletes the files of the format specified in the config and leaves the other files untouched (might need a test there, but first want to know your opinion about this).

@KentonWhite
Copy link
Owner

Thanks @gisler ! I'll have time over the weekend to review in detail. Please don't think I'm ignoring this...

@gisler
Copy link
Contributor Author

gisler commented Nov 24, 2020

No worries, we all have our duties, but I appreciate your message nonetheless.

@KentonWhite
Copy link
Owner

Thanks for this @gisler, And thanks for writing clear tests as well!

I noticed that you are using ifelse statements for implementing the cashing strategy. This works if we never are going to support other caching formats other than qs. This might not be the case. What do you think about implementing a strategy pattern with a little bit of meta-programming? What I mean is using a pattern like is used for loading different data types. This is a simple list with the cache type prefix followed by a function that implements the caching strategy.

For example, in load,.project.R lines 228-232 could use a data structure like:

cache.stratagies <- list(
  "qs" = "assign(variable, qs::qread(cache.file), .TargetEnv)",
  "R"   = "load(cache.file, envir = .TargetEnv)"
)

Then the code could read:

eval(parse(cache.stratagies[config$cache_file_format]))

Of course I mean this a pseudo-code and it is not guaranteed to work, etc....

What do you think?

@gisler
Copy link
Contributor Author

gisler commented Nov 29, 2020

Hi, you're welcome, glad you like it, @KentonWhite. And yes, I did include tests, although I must say that I more or less only copied and adapted existing tests.

Anyway, while I very much like metaprogramming (I use it quite often in my DTSg package), I don't think it is necessary here. One could either use an if ... else if ... else statement to include more formats in the future or probably more elegant the rather unknown switch() function: https://stat.ethz.ch/R-manual/R-devel/library/base/html/switch.html

It achives about what you suggest, but without the parsing of strings. If you like, I can change the statements so that they make use of this function for future reference.

@KentonWhite
Copy link
Owner

KentonWhite commented Nov 29, 2020 via email

@gisler
Copy link
Contributor Author

gisler commented Nov 29, 2020

Ah, I see. Well, there is one for saving, one for loading and one defining the file extensions. If the config equals the file extension (which it actually does as it is now), we can easily erase the need to intervene there. This is a good idea of yours.

Erasing the rest is a bit more complicated though, as there are two expressions involved and for future formats usually also the requirement to load a package. It should be possible nonetheless using expression(). I'll look into it next week.

@gisler
Copy link
Contributor Author

gisler commented Dec 3, 2020

Hi @KentonWhite,

I committed a rough implementation of your idea (at least I hope so) yesterday. It still requires some polishing and explanatory comments though:

ProjectTemplate/R/cache.R

Lines 171 to 184 in 88343b8

.cache.formats <- list(
RData = expression(
package = "base",
file_ext = "RData",
load_expr = load(cache.file, envir = .TargetEnv),
save_expr = save(list = variable, envir = .TargetEnv, file = cache_filename$obj, ...)
),
qs = expression(
package = "qs",
file_ext = "qs",
load_expr = assign(variable, qs::qread(cache.file), .TargetEnv),
save_expr = qs::qsave(get(variable, envir = .TargetEnv), file = cache_filename$obj, ...)
)
)

So with this, cache file formats could in future be specified by a named list of named expressions. One just has to keep in mind to add potentially required packages to the suggested packages list. One drawback though is that the expressions used for saving and loading depend on objects living in the respective functions only. This requires careful explanation.

@KentonWhite
Copy link
Owner

Thanks @gisler! I'll take a more detailed look over the weekend.

@gisler
Copy link
Contributor Author

gisler commented Dec 3, 2020

Perfect, I finished the polishing and explanations in the meantime.

@gisler
Copy link
Contributor Author

gisler commented Dec 4, 2020

Sorry, had to "kick" in a final commit to this PR. The code should be "cleaner" now.

@gisler
Copy link
Contributor Author

gisler commented Dec 4, 2020

@KentonWhite Oh, before I forget it and since you are preparing a release: https://github.com/KentonWhite/ProjectTemplate/blob/f0f07ec386c70674647b55773cf0974905dd7603/R/project.config.R is missing underscore_variables in its documentation.

I don't want to include a fix in this PR and it's kinda not worth to open a separate Issue or PR for it, but I can do so in case you prefer.

@KentonWhite
Copy link
Owner

Thanks @gisler Looks good! One question and 1 comment before I merge in:

Have you checked that migrate project works as expected? It should and is always nice to confirm 👍

Since I needed to bump the version to 0.9.3 to fix an error with the new R release, can you change the version to 0.10.0 please.

@gisler
Copy link
Contributor Author

gisler commented Dec 8, 2020

Okay, no problem, @KentonWhite. I tested migrating the project and it works, although I wonder why it prints FALSE:

> migrate.project()
Migrating project configuration
[1] FALSE
Your existing project configuration in globals.dcf does not contain up to
date configuration settings in this version 0.10.0 of ProjectTemplate.  They will
be added automatically during migration, but you should review afterward.
Your config.dcf file was missing entries and defaults
have been used.  The missing entries are:
 cache_file_format

And it made me add a suppressWarnings() to .load.config() leading to cleaner messages for users.

@gisler
Copy link
Contributor Author

gisler commented Dec 8, 2020

By the way, I just did a few non-representative benchmarks on my machine:

library(arrow)
library(qs)
library(microbenchmark)
library(data.table)

data(diamonds, package = "ggplot2")
dim(diamonds)

[1] 53940 10

microbenchmark(
  write_parquet(diamonds, "test.parquet", version = "2.0"),
  write_feather(diamonds, "test.feather"),
  qsave(diamonds, "test.qs"),
  save(diamonds, file = "test.RData")
)
Unit: milliseconds
                                                     expr       min        lq      mean    median        uq       max neval
 write_parquet(diamonds, "test.parquet", version = "2.0")  15.31879  15.87638  18.95168  16.46482  18.46235  60.54349   100
                  write_feather(diamonds, "test.feather")  11.07797  12.84449  24.74501  22.12534  30.57151  84.52629   100
                               qsave(diamonds, "test.qs")  25.79260  27.25619  33.19539  27.97628  34.34300  98.84864   100
                      save(diamonds, file = "test.RData") 119.54659 120.99080 122.60779 122.17242 122.56934 190.21427   100
microbenchmark(
  read_parquet("test.parquet"),
  read_feather("test.feather"),
  qread("test.qs"),
  load("test.RData")
)
Unit: milliseconds
                         expr      min       lq      mean   median       uq       max neval
 read_parquet("test.parquet") 3.683474 3.829881  5.178889 3.917530 4.064670  63.03108   100
 read_feather("test.feather") 3.230476 3.459291  4.808098 3.613256 3.767588  62.98305   100
             qread("test.qs") 3.571077 3.724800  9.885440 3.804038 4.074666 247.98589   100
           load("test.RData") 8.170787 8.422276 13.941885 8.486032 8.703510  72.46114   100
n <- 1e6
DT <- data.table(
  integer = 1:n,
  numeric = rnorm(n),
  character = stringi::stri_rand_strings(n, 12)
)

microbenchmark(
  write_parquet(DT, "test.parquet", version = "2.0"),
  write_feather(DT, "test.feather"),
  qsave(DT, "test.qs"),
  save(DT, file = "test.RData"),
  times = 12
)
Unit: milliseconds
                                               expr       min        lq      mean    median        uq       max neval
 write_parquet(DT, "test.parquet", version = "2.0")  699.6421  731.7620  746.7013  751.3195  765.9379  777.2242    12
                  write_feather(DT, "test.feather")  689.9112  719.1069  761.8496  770.7676  798.0216  818.9127    12
                               qsave(DT, "test.qs")  384.6411  392.1583  404.9736  398.8102  420.8910  429.9794    12
                      save(DT, file = "test.RData") 1612.9240 1634.8561 1653.2237 1656.8916 1665.7625 1699.2989    12
microbenchmark(
  read_parquet("test.parquet"),
  read_feather("test.feather"),
  qread("test.qs"),
  load("test.RData"),
  times = 12
)
Unit: milliseconds
                         expr      min       lq     mean   median       uq      max neval
 read_parquet("test.parquet") 240.2281 245.9147 258.9374 247.9435 250.3795 324.1385    12
 read_feather("test.feather") 222.1977 235.2760 325.8992 303.2075 425.3964 487.9860    12
             qread("test.qs") 161.7858 167.5878 213.3876 170.3831 238.4205 426.0841    12
           load("test.RData") 398.0150 401.0278 433.4332 409.4679 475.0042 481.9437    12

I generally used default settings. File sizes are in the case of DT:

  • Parquet: 25.9 MB
  • Feather: 26.7 MB
  • qs: 15.9 MB
  • RData: 19.2 MB

Scaling DT to 1e7 rows looks like this with three repetitions for writing

Unit: seconds
                                               expr       min        lq      mean    median        uq       max neval
 write_parquet(DT, "test.parquet", version = "2.0")  6.813148  6.888617  6.963231  6.964086  7.038272  7.112459     3
                  write_feather(DT, "test.feather")  7.336708  7.373148  7.465723  7.409587  7.530231  7.650874     3
                               qsave(DT, "test.qs")  4.061056  4.065581  4.068474  4.070105  4.072182  4.074259     3
                      save(DT, file = "test.RData") 15.569769 15.626296 15.741904 15.682823 15.827971 15.973119     3

and like this for reading

Unit: seconds
                         expr      min       lq     mean   median       uq      max neval
 read_parquet("test.parquet") 2.521096 2.525684 2.548139 2.530273 2.561661 2.593049     3
 read_feather("test.feather") 2.329032 2.332010 2.342300 2.334988 2.348934 2.362880     3
             qread("test.qs") 1.609163 1.615720 1.838861 1.622277 1.953710 2.285143     3
           load("test.RData") 3.944234 3.982078 4.316626 4.019922 4.502822 4.985722     3

File sizes scaled approximately linearly.

Leaving one column out at a time, changes the timing, but not the ranking and adding a POSIXct column also doesn't change much:

n <- 1e7
DT <- data.table(
  integer = 1:n,
  numeric = rnorm(n),
  character = stringi::stri_rand_strings(n, 12),
  timestamp = as.POSIXct(1:n, origin = "1970-01-01")
)

microbenchmark(
  write_parquet(DT, "test.parquet", version = "2.0"),
  write_feather(DT, "test.feather"),
  qsave(DT, "test.qs"),
  save(DT, file = "test.RData"),
  times = 3
)
Unit: seconds
                                               expr       min        lq      mean    median        uq       max neval
 write_parquet(DT, "test.parquet", version = "2.0")  7.837889  8.659331  8.979386  9.480773  9.550135  9.619496     3
                  write_feather(DT, "test.feather")  8.493490  8.580802  8.629774  8.668114  8.697916  8.727718     3
                               qsave(DT, "test.qs")  4.232069  4.236450  4.296098  4.240831  4.328113  4.415395     3
                      save(DT, file = "test.RData") 21.064341 21.112579 21.158563 21.160817 21.205675 21.250533     3

Maybe it's different with less random data and non-defaults. I only tried out rbinding the diamonds dataset 200 times resulting in more than ten million rows with ten variables:

Unit: seconds
                                                     expr       min        lq      mean    median        uq       max neval
 write_parquet(diamonds, "test.parquet", version = "2.0")  2.989435  3.051021  3.072574  3.112606  3.114143  3.115681     3
                  write_feather(diamonds, "test.feather")  5.718497  5.860217  6.133362  6.001938  6.340794  6.679651     3
                               qsave(diamonds, "test.qs")  1.916572  1.928476  1.961913  1.940380  1.984583  2.028787     3
                      save(diamonds, file = "test.RData") 24.133917 24.273026 24.345237 24.412135 24.450898 24.489660     3

Anyway, I think qs definitely is a very, very good cache file format. Especially since Parquet and Feather are limited to tabular data only as far as I can see.

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252    LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Austria.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-7 qs_0.23.4            data.table_1.13.0    arrow_2.0.0         

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5          digest_0.6.25       assertthat_0.2.1    R6_2.4.1            formatR_1.7         magrittr_1.5        RcppParallel_5.0.2 
 [8] stringi_1.4.6       rlang_0.4.7         vctrs_0.3.2         stringfish_0.14.2   RApiSerialize_0.1.0 tools_4.0.2         bit64_4.0.2        
[15] glue_1.4.1          purrr_0.3.4         bit_4.0.4           compiler_4.0.2      tidyselect_1.1.0   

P.S. I know, it's strange to do such tests once the work is done and not before it, but then, I knew qs would beat RData. Beating Parquet never was the goal.

EDIT: P.P.S. Did some more experimenting: The read and write functions from the feather package outperform those from the arrow package and together with the fst format show about the same performance as qs. But this is a very rough estimate.

@KentonWhite
Copy link
Owner

The migrate test looks bizarre. I will try it on an existing project and see if I get the same issues.

@KentonWhite
Copy link
Owner

It looks good! merging after tests have passed.

@KentonWhite KentonWhite merged commit 6cd3015 into KentonWhite:master Dec 8, 2020
@gisler
Copy link
Contributor Author

gisler commented Dec 8, 2020

The migrate test looks bizarre. I will try it on an existing project and see if I get the same issues.

Seems like the function is printing if the cache directory is not existing, which was FALSE in my case, as it was there.

And cool, looking forward to make use of it in production. Hopefully, there are no bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants