Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vroom blogpost #308

Merged
merged 11 commits into from
May 7, 2019
Merged

Add vroom blogpost #308

merged 11 commits into from
May 7, 2019

Conversation

jimhester
Copy link
Contributor

No description provided.

@jimhester
Copy link
Contributor Author

Looking at the output I think I should do something about how much output is generated for each code block. Either

  1. Always assign the results to a variable rather than letting it auto-print.
  2. Provide a specification or use message = FALSE to suppress the column type message.

Or maybe even both... thoughts?

@jimhester
Copy link
Contributor Author

@hadley, @jennybc or @batpigandme it would be great if one or more of you could review this!

@batpigandme
Copy link
Collaborator

On it!

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good — I don't need to re-review before you publish.

I'm excited to announce that [vroom 1.0.0](http://vroom.r-lib.org) is now
available on CRAN!

vroom reads rectangular data, such as comma separated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can combine these sentences into a paragraph.

file lazily; you only pay for the data you use. This lazy access is done
automatically, so no changes to your R data manipulation code are needed.

vroom also provides efficient multi-threaded writing that is multiple times
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a couple of sentences about vroom vs readr somewhere. i.e. we're not entirely sure yet, but we'll probably let them evolve separately for a little bit, but we plan to unite in the future. The major downside of vroom is that the laziness means that you can't get all problems up front, so unification will require some thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are right, I forgot to include something like this.


Compared to readr, the first difference you may note is you use only one
function to read the files, `vroom()`. This is because `vroom()` guesses the
delimiter of the file automatically (based on the first few lines). This works
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention this is inspired by data.table

vroom.

```{r}
# Split the flights data by carrier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hide this code, and instead say something like: "Imagine we have a directory containing ..."


It can even read gzipped files from the internet (although currently not the other compressed formats).

## Reading and writing from pipe connections
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't think this is important enough to include in the announcement, and including live code in data import makes me nervous.


## Column types

Like readr, vroom guesses the data types of columns as they are read, however sometimes it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention improved heuristic (i.e. looks at data throughout file, not just first n rows)


vroom is fast, but how fast?
We benchmarked vroom using a real world dataset of taxi trip data, with
14.7 million rows, 11 columns. It contains a mix of numeric and textual data and has a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
14.7 million rows, 11 columns. It contains a mix of numeric and textual data and has a
14.7 million rows, 11 columns. It contains a mix of numeric and text data and has a

- Filtering for "UNK" payment, this is 6434 rows (0.0435% of total).
- Aggregation of mean fare amount per payment type.

<style>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this to #307 ?


Some things to note in the results. The initial reading is much faster in vroom
than any other method, and most of the manipulations, such as `print()`,
`head()`, `tail()` and `sample()` are equally fast. However because the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So fast you can't see them in the plot

`head()`, `tail()` and `sample()` are equally fast. However because the
character data is read lazily operations such as `filter()` and `aggregrate()`
which need character values require additional time.
However this cost will only occur once, after the values have been read they
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why are both "aggregate" and "filter" quite wide?

(You might also rename "aggregate" to "summarise" because I keep thinking you mean the base function)

jimhester added 3 commits May 7, 2019 09:32
The colored output is kind of beside the point for this, and it makes
some things worse, like the separate blocks and comment highlighting.
Copy link
Collaborator

@batpigandme batpigandme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done! I made a few suggestions, most of which are minor (commas, etc). Let me know if you have any questions, or if you're cool with these changes and want me to just go ahead and make them.

(csv), tab separated (tsv) or fixed width files (fwf) into R.

It performs the
same function as packages like [readr](http://readr.r-lib.org),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should either be functions (with an s), and packages, or "function like readr::read_csv(), data.table::fread()..." (so the list should either be packages or functions). Other option would be to sub out "function" for "role" in this first instance, since you specifically state that read.csv() is a function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe similar to instead of same as? (I'm just thinking about the scope of data.table).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah totally right, will change


The main reason vroom can be faster is because character data is read from the
file lazily; you only pay for the data you use. This lazy access is done
automatically, so no changes to your R data manipulation code are needed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hyphenate data-manipulation here (technically also R, R-data-manipulation, but I think that looks weird)

file lazily; you only pay for the data you use. This lazy access is done
automatically, so no changes to your R data manipulation code are needed.

vroom also provides efficient multi-threaded writing that is multiple times
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma between efficient and multi-threaded

automatically, so no changes to your R data manipulation code are needed.

vroom also provides efficient multi-threaded writing that is multiple times
faster on most inputs than the readr writer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe readr::write_*() functions?

```

The summary message after reading also differs from readr. We hope this output
gives a more informative indication if the types of your columns are being guessed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"indication as to whether the types"?

## Speed

vroom is fast, but how fast?
We benchmarked vroom using a real world dataset of taxi trip data, with
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

real-world gets hyphenated here, since it's modifying dataset (technically taxi-trip, too, but dealer's choice there)


vroom is fast, but how fast?
We benchmarked vroom using a real world dataset of taxi trip data, with
14.7 million rows, 11 columns. It contains a mix of numeric and textual data and has a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put a comma after "a mix of numeric and textual data", and has... since you've got a list w/in list situation

Some things to note in the results. The initial reading is much faster in vroom
than any other method, and most of the manipulations, such as `print()`,
`head()`, `tail()` and `sample()` are equally fast. However because the
character data is read lazily operations such as `filter()` and `aggregrate()`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset "which need character values" here with commas

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, comma after lazily

`head()`, `tail()` and `sample()` are equally fast. However because the
character data is read lazily operations such as `filter()` and `aggregrate()`
which need character values require additional time.
However this cost will only occur once, after the values have been read they
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this sentence I'd suggest:
However, this cost will only occur once. After the values have been read, they will be stored in memory, and subsequent accesses will be equivalent to other packages.

content/articles/2019-05-vroom-1-0-0.Rmarkdown Outdated Show resolved Hide resolved

vroom reads rectangular data, such as comma separated
(csv), tab separated (tsv) or fixed width files (fwf) into R. It performs
similar roles to functions like [readr::read_csv()](http://readr.r-lib.org),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops — need backticks around readr::read_csv() and data.table::fread()

@jimhester jimhester merged commit 114f179 into tidyverse:master May 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants