-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtibbles.Rmd
197 lines (160 loc) · 6.66 KB
/
tibbles.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
title: "The Trouble with Tibbles"
author: "Theo Roe"
date: "2 October 2017"
output: html_document
---
<style>
<!--body .main-container {
max-width: 650px;
} -->
.column-left{
float: left;
width: 50%;
text-align: left;
}
.column-center{
display: inline-block;
width: 33%;
text-align: center;
}
.column-right{
float: right;
width: 50%;
text-align: left;
}
.col2 {
columns: 2 200px; /* number of columns and width in pixels*/
-webkit-columns: 2 200px; /* chrome, safari */
-moz-columns: 2 200px; /* firefox */
}
.col3 {
columns: 3 100px;
-webkit-columns: 3 100px;
-moz-columns: 3 100px;
}
.pull-left {
float: left;
width: 20%;
}
.pull-right {
float: right;
width: 20%;
}
</style>
```{r setup,include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Let's get something straight, there isn't really any trouble with tibbles. I'm hoping you've noticed this is a play on 1967 Star Trek episode, "The Trouble with Tribbles". I've recently got myself a job as a Data Scientist, here, at [Jumping Rivers](https://www.jumpingrivers.com/). Having never come across tibbles until this point, I now find myself using them in nearly every R script I compose. Be that your timeless standard R script, your friendly Shiny app or an analytical Markdown document.
### _What are tibbles?_
Presumably this is why you came here, right?
Tibbles are a modern take on data frames, but crucially they _are_ still data frames. Well, what's the difference then? There's a quote I found somewhere on the internet that decribes the difference quite well;
_"keeping what time has proven to be effective, and throwing out what is not"_.
Basically, some clever people took the classic `data.frame`, shook it til the bad parts fell out, then added some better new features.
### _Precursors_
```{r eval = FALSE}
#Tibbles are a part of the tidyverse package, so the easiest way to get access is to install the whole `tidyverse` package
install.packages("tidyverse")
#alternatively, install only tibble
install.packages("tibble")
```
```{r eval = FALSE}
library("tidyverse")
library("tibble")
```
```{r include = FALSE}
library("tidyverse")
library("tibble")
```
### _Tribblemaking_
There are 3 ways to form a tibble. It pretty much acts as your friendly old pal `data.frame` does. Just like standard data frames, we can create tibbles, coerce objects into tibbles and import data sets into `R` as a tibble. Below is a table of the traditional `data.frame` commands and their respective `tidyverse` commands.
_Formation Type_| _Data Frame Commands_ | *Tibbles Commands*
--------------- | -------------------------------| ---------------------------------------------
_Creation_ | `data.frame` | `data_frame` `tibble` `tribble`
_Coercion_ | `as.data.frame` | `as_data_frame` `as_tibble`
_Importing_ | `read.` family | `read_delim` `read_csv` `read_csv2` `read_tsv`
Examples of the new commands are avaiable when you [click here.](examples.html)
### _Tibbles vs Data Frames_
All talk, no walk so far right? "Where are the tibbles?" I hear you cry. Well, here we go.
```{r}
tibble(a = 1:26, b = letters)
```
The first thing you should notice is the pretty print process. The class of each column is now displayed above it and the dimensions of the tibble are shown at the top. The default print option within tibbles mean they will only display 10 rows if the data frame is larger has more than 20 rows. Neat. Along side that we now only view columns that will fit on the screen. This is already looking quite the part. The row settings can be changed via
```{r}
options(tibble.print_max = 3, tibble.print_min = 1)
```
So now if there is more than 3 rows, we print only 1 row. Tibbles of length 3 and 4 would now print as so.
<!-- default is 20, 10-->
<div class = "col2">
```{r}
tibble(1:3)
```
```{r}
tibble(1:4)
```
</div>
Yes, OK, you could do this with the traditional data frame. But it would be a lot more work, _right_?
As well as the fancy printing, tibbles don't drop the variable type, don't partial match and they allow non-syntactic column names when importing data in. Let's say we have some data in a [csv file](x.csv). It has 3 non-syntactic column names and one column of characters. Reading this is as a tibble and a data frame we get
```{r, include=FALSE}
tib = read_csv("x.csv")
options(tibble.print_max = 10, tibble.print_min = 5)
```
<div class = "col2">
```{r, eval = FALSE}
tib = read_csv("x.csv")
```
```{r}
tib
```
***
```{r}
df = read.csv("x.csv")
```
```{r}
df
```
We see already that in the `read.csv` process we've lost the column names.
</div>
Let's try some partial matching...
<div class = "col2">
```{r}
tib$n
```
```{r}
df$n
```
</div>
With the tibble we get an error, yet with the data frame it leads us straight to our `name` variable. To read more about why partial matching is bad, check out [this thread](http://r.789695.n4.nabble.com/Deprecating-partial-matching-in-data-frame-td4661898.html).
***
What about subsetting? Let's try it out using the data from our csv file.
<div class = "col2">
```{r}
tib[,1]
df[,1]
```
We've returned a 4x1 tibble but only a length 4 vector. Using tibbles, single square brackets, `[`, will always return another tibble. Much neater.
Now for double brackets.
</div>
***
<div class = "col2">
```{r}
tib[[1]]
tib$name
df[[1]]
df$name
```
Double square brackets, `[[`, and the traditional dollar, `$` are ways to access individual columns as vectors. Now, with tibbles, we have seperate operations for data frame operations and single column operations. Now we don't have to use that pesky `drop = FALSE`. Note, these are actually quicker than the `[[` and `$` of the
</div>
`data.frame`, as shown in the [documentation for the tibble package](https://cran.r-project.org/web/packages/tibble/tibble.pdf).
***
At last, no more strings as factors! Upon reading the data in, tibbles recognise _strings as strings_, not factors. For example, with the name column in our data set.
<div class = "col2">
```{r}
class(df$name)
class(tib$name)
```
</div>
I quite like this, it's much easier to turn a vector of characters into factors than vice versa, so why not give me everything as strings? Now I can choose whether or not to convert to factors.
***
So, the majority of the advantages of tibbles can still be done with `data.frame`, but it's bit of a pain. Simple things like checking the dimensions of your data or converting not converting strings to factors are small jobs. Small jobs that take time. With tibbles they take no time. Tibbles force you to look at your data earlier; confront the problems earlier. Ultimately leading to cleaner code.
Thanks for chatting!