-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vroom blogpost #308
Add vroom blogpost #308
Conversation
Looking at the output I think I should do something about how much output is generated for each code block. Either
Or maybe even both... thoughts? |
I think it is too cluttered
@hadley, @jennybc or @batpigandme it would be great if one or more of you could review this! |
On it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good — I don't need to re-review before you publish.
I'm excited to announce that [vroom 1.0.0](http://vroom.r-lib.org) is now | ||
available on CRAN! | ||
|
||
vroom reads rectangular data, such as comma separated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can combine these sentences into a paragraph.
file lazily; you only pay for the data you use. This lazy access is done | ||
automatically, so no changes to your R data manipulation code are needed. | ||
|
||
vroom also provides efficient multi-threaded writing that is multiple times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a couple of sentences about vroom vs readr somewhere. i.e. we're not entirely sure yet, but we'll probably let them evolve separately for a little bit, but we plan to unite in the future. The major downside of vroom is that the laziness means that you can't get all problems up front, so unification will require some thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you are right, I forgot to include something like this.
|
||
Compared to readr, the first difference you may note is you use only one | ||
function to read the files, `vroom()`. This is because `vroom()` guesses the | ||
delimiter of the file automatically (based on the first few lines). This works |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention this is inspired by data.table
vroom. | ||
|
||
```{r} | ||
# Split the flights data by carrier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd hide this code, and instead say something like: "Imagine we have a directory containing ..."
|
||
It can even read gzipped files from the internet (although currently not the other compressed formats). | ||
|
||
## Reading and writing from pipe connections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I don't think this is important enough to include in the announcement, and including live code in data import makes me nervous.
|
||
## Column types | ||
|
||
Like readr, vroom guesses the data types of columns as they are read, however sometimes it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention improved heuristic (i.e. looks at data throughout file, not just first n
rows)
|
||
vroom is fast, but how fast? | ||
We benchmarked vroom using a real world dataset of taxi trip data, with | ||
14.7 million rows, 11 columns. It contains a mix of numeric and textual data and has a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
14.7 million rows, 11 columns. It contains a mix of numeric and textual data and has a | |
14.7 million rows, 11 columns. It contains a mix of numeric and text data and has a |
- Filtering for "UNK" payment, this is 6434 rows (0.0435% of total). | ||
- Aggregation of mean fare amount per payment type. | ||
|
||
<style> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this to #307 ?
|
||
Some things to note in the results. The initial reading is much faster in vroom | ||
than any other method, and most of the manipulations, such as `print()`, | ||
`head()`, `tail()` and `sample()` are equally fast. However because the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So fast you can't see them in the plot
`head()`, `tail()` and `sample()` are equally fast. However because the | ||
character data is read lazily operations such as `filter()` and `aggregrate()` | ||
which need character values require additional time. | ||
However this cost will only occur once, after the values have been read they |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why are both "aggregate" and "filter" quite wide?
(You might also rename "aggregate" to "summarise" because I keep thinking you mean the base function)
They obscure the point I think
The colored output is kind of beside the point for this, and it makes some things worse, like the separate blocks and comment highlighting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done! I made a few suggestions, most of which are minor (commas, etc). Let me know if you have any questions, or if you're cool with these changes and want me to just go ahead and make them.
(csv), tab separated (tsv) or fixed width files (fwf) into R. | ||
|
||
It performs the | ||
same function as packages like [readr](http://readr.r-lib.org), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should either be functions (with an s), and packages, or "function like readr::read_csv()
, data.table::fread()
..." (so the list should either be packages or functions). Other option would be to sub out "function" for "role" in this first instance, since you specifically state that read.csv()
is a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe similar to instead of same as? (I'm just thinking about the scope of data.table).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah totally right, will change
|
||
The main reason vroom can be faster is because character data is read from the | ||
file lazily; you only pay for the data you use. This lazy access is done | ||
automatically, so no changes to your R data manipulation code are needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hyphenate data-manipulation here (technically also R, R-data-manipulation, but I think that looks weird)
file lazily; you only pay for the data you use. This lazy access is done | ||
automatically, so no changes to your R data manipulation code are needed. | ||
|
||
vroom also provides efficient multi-threaded writing that is multiple times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comma between efficient and multi-threaded
automatically, so no changes to your R data manipulation code are needed. | ||
|
||
vroom also provides efficient multi-threaded writing that is multiple times | ||
faster on most inputs than the readr writer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe readr::write_*()
functions?
``` | ||
|
||
The summary message after reading also differs from readr. We hope this output | ||
gives a more informative indication if the types of your columns are being guessed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"indication as to whether the types"?
## Speed | ||
|
||
vroom is fast, but how fast? | ||
We benchmarked vroom using a real world dataset of taxi trip data, with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
real-world gets hyphenated here, since it's modifying dataset (technically taxi-trip, too, but dealer's choice there)
|
||
vroom is fast, but how fast? | ||
We benchmarked vroom using a real world dataset of taxi trip data, with | ||
14.7 million rows, 11 columns. It contains a mix of numeric and textual data and has a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put a comma after "a mix of numeric and textual data", and has... since you've got a list w/in list situation
Some things to note in the results. The initial reading is much faster in vroom | ||
than any other method, and most of the manipulations, such as `print()`, | ||
`head()`, `tail()` and `sample()` are equally fast. However because the | ||
character data is read lazily operations such as `filter()` and `aggregrate()` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset "which need character values" here with commas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, comma after lazily
`head()`, `tail()` and `sample()` are equally fast. However because the | ||
character data is read lazily operations such as `filter()` and `aggregrate()` | ||
which need character values require additional time. | ||
However this cost will only occur once, after the values have been read they |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this sentence I'd suggest:
However, this cost will only occur once. After the values have been read, they will be stored in memory, and subsequent accesses will be equivalent to other packages.
|
||
vroom reads rectangular data, such as comma separated | ||
(csv), tab separated (tsv) or fixed width files (fwf) into R. It performs | ||
similar roles to functions like [readr::read_csv()](http://readr.r-lib.org), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops — need backticks around readr::read_csv()
and data.table::fread()
No description provided.