-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbasics.Rmd
939 lines (686 loc) · 29.5 KB
/
basics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
---
output:
html_notebook:
toc: no
toc_depth: 3
toc_float: yes
html_document:
toc: no
toc_depth: 3
toc_float:
collapsed: no
---
# Basic Data Structures {.tabset}
<!--
- Drive rest following DSR, supplemented by datacamp (incl "writing functions"), mixing in old notes and xrefing other materials where desired
- Get next files: C:\Users\jhoule\Google Drive\R Programming
- Subsetting will be its own notebook to handle base/frames, dplyr/tibbles, data.tables, matrices and perhaps some SQL or other frameworks
- Revisit Advanced R for this
- Strings - all from tidyverse (later in DSR)
- tibble (later in DSR)
- Factors/forcats (later in DSR)
- Lubridate (later in DSR)
- data.table (later...)
-->
These notes cover some of the basic data structures in R and the tidyverse, augmented over time as my experience has grown. Please see credits for more information.
R uses the following basic structures for its data. Note that R has no 0-dimensional, or scalar types: individual numbers or strings are actually vectors of length one.
| | Homogeneous | Heterogeneous |
|----|---------------|---------------|
| 1d | Atomic vector | List |
| 2d | Matrix | Data frame |
| nd | Array | |
## Atomic Vectors {.tabset}
### Basic Types {.tabset .tabset-pills}
#### Logical
Logical operators will impute a logical type on the result of operations on numeric, integer, complex, or logical types.
```{r Logical}
str(TRUE)
str(F)
str(4 & TRUE)
str(5L | 0)
str(!0L)
typeof(TRUE)
```
#### Integers
Individual objects require suffix `L`.
```{r Integer}
str(9L)
typeof(9L)
```
#### Doubles
Real numbers. Note that complex numbers are also an atomic type, but aren't covered here.
```{r Double}
str(0.5)
typeof(0.5)
```
#### Characters
Character vectors have both type and class of `character`.
```{r Character}
str("a")
typeof("a")
```
#### (Numerics)
Numeric is not a *type*, it is a *class* that covers both integers and doubles. Somewhat confusingly, `str` on a double returns `num`.
```{r Numeric}
typeof(0.5)
str(0.5)
class(0.5)
class(9L)
is.numeric(9L)
is.double(9L)
```
#### Coercion
Coercion forces every element in vector to same class. This often happens automatically with mathematical or logical operators, which try to coerce to an appropriate type. It may also happen behind the scenes when concatenating vectors, in which case the resulting vector is the most-flexible type. Types from least to most flexible are (as above): logical, integer, double, and character.
```{r Automatic coercion}
str(1 | F)
str(1L + T)
str(1 + T)
str(c("1",seq(0,1,len=4)))
```
Explicit coercion is possible using `as.*`
```{r Explicit coercion}
x <- 0:6
class(x)
as.double(x)
as.logical(x)
as.character(x)
```
If a nonsensical explicit coercion is attempted, `NA` is returned. The `NA` is of the coerced type (since atomic vectors are all of the same type.)
```{r NAs from coercion}
x <- c("a","b","c")
(y <- as.double(x))
typeof(y)
as.logical(x)
as.character(x)
```
`as.numeric()` will coerce to a double.
```{r as.numeric()}
x <- c(FALSE, FALSE, TRUE)
(y <- as.numeric(x))
typeof(y)
```
#### Missing Values
R uses a few different symbols for missing values. Perhaps the most common is `NA`. This is a true 'missing value' indicator - some value belongs here, but for some reason it is not present. Its type is `logical`, but can also be coerced to any other atomic vector type.
```{r}
x1 <- NA
is.na(x1)
is.nan(x1) #FALSE - NA is not NaN
is.null(x1)
class(x1) #NA values have a class
```
`NaN` is a mathematical term, commonly observed in division by zero. Notably, this also meets the definition of `NA`.
```{r}
x2 <- 0/0
typeof(x2)
is.na(x2) #TRUE - NaN is NA
is.nan(x2)
is.null(x2)
```
`NULL` indicates an empty set, without any type. Because of this, many functions will fail to evalute on it, returning an empty vector (though typed and not `NULL`).
```{r}
x3 <- NULL
class(x3)
is.na(x3)
is.nan(x3)
is.null(x3)
c(1:3,x3,5:7)
```
### Attributes {.tabset .tabset-pills}
#### In general
Most attributes are lost when modifying a vector, for more coverage review [here](http://adv-r.had.co.nz/Data-structures.html#attributes).
Three attributes are retained, and these have special accessor functions to get and set values:
- Names, `names()`
- Dimensions, `dim()`
- Class, `class()`
`attributes()` can be used to view all of a vector's attributes (as a list). `attr()` can also take an argument of the attribute name to return the attribute itself. This method can (but should not) be used to set the attribute value as well.
```{r attr()}
x <- c(foo = 1, bar =2)
attributes(x)
attr(x, "names")
attr(x, "dim")
attr(x, "class")
```
#### `names()`
You can name a vector in three ways:
- When creating it: `x <- c(a = 1, b = 2, c = 3)`. (Note that quotes are not needed for single words, but must be used for any names with spaces included.)
- By modifying an existing vector in place using `names()`: `x <- 1:3; names(x) <- c("a", "b", "c")`.
- By creating a modified copy of a vector: `x <- setNames(1:3, c("a", "b", "c"))`.
Names should be unique to best serve their purposes, but this is not required by the language.
Not all elements of a vector need to have a name. If some names are missing, `names()` will return an empty string for those elements. If all names are missing, `names()` will return 'NULL'.
```{r names()}
y <- c(a = 1, 2, 3)
names(y)
z <- c(1, 2, 3)
names(z)
```
You can create a new vector without names using unname(x), or remove names in place with names(x) <- NULL.
#### `dim()`
Adding a `dim` attribute to an atomic vector (using `dim()`) modifies it to behave like a two-dimensional matrix or multi-dimensional array.
```{r dim()}
x <- 1:8
dim(x) <- c(2,4)
print(x)
dim(x) <- c(2,2,2)
print(x)
```
The `dim` vector specifies rows, then columns, and so on into higher dimensions. This is also the order in which values are filled in. Note that the new dimension **must** match the number of vector elements exactly.
#### `class()`
This is used to distinguish multi-dimensional types, since matrices and arrays are built on top of vectors, and data.frames are built on top of lists.
Actual implementation is rather complicated and not import outside of OO context.
### Creating Vectors {.tabset .tabset-pills}
#### Vector creation basics
`vector()` creates an 'empty' vector using the empty value of whatever type is supplied.
```{r Creating empty vectors}
vector("numeric", length=10)
vector("character", length=10)
```
The `c()` command creates vectors of objects of same class, or to concatenate other vectors together:
```{r c() command}
# Character vector
c("a","b","c")
c(x, 0, x)
```
The `:` command can be used as a shortcut to create sequences (subject to order of operations).
```{r Vector creation shortcuts}
29:23
21:29-1
21:(29-1)
```
#### `seq()`
`seq()` and `rep()` can also be used to quickly create long vectors that can follow increasingly complex patterns. `seq()` is typically used as one of the following:
`seq(from, to)` generates a simple sequence, identical to `from:to`
```{r}
seq(8,3)
```
`seq(from, to, by = )` generates the same, but provides the step (note that the step sign must be correct). Because of this, the endpoint may not equal (and if so will fall short of) `to`.
```{r}
seq(8,3,-2)
```
`seq(from, to, length.out = )` matches the `from` and `to` values and fills in equally-spaced values to create a vector of length `length.out`. The argument name can be shortened to `len` or `length`.
```{r}
seq(8,3,length.out = 5)
```
`seq(along.with = )`, `seq(from)`, and `seq(length.out = )` will all create a sequence from 1 to the number (or length of vector) supplied.
```{r}
seq(along.with = 8:3)
seq(8:3)
seq(length.out = 8)
```
`seq(from, by =, along.with =)` can be used to create a vector `along` another vector that starts at `from` and counts `by` a step value. (Note that using `by` as a vector is possible but introduces very confusing element-wise computations.)
```{r}
seq(10, by = 3, along = 1:4)
```
#### `rep()`
`rep()` effectively takes the following arguments:
`times` is the default second argument but is effective *after* `each` is applied. If `times` is a single number (note that computed values are rounded down), the sequence will be repeated that number of times. If `times` is a vector (it must be the same length as `x` *after* applying `each`, if any), each element is replicated elementwise per the `times` vector.
```{r rep()}
rep(1:3, 2)
rep(1:3, c(2,1,5))
```
`each` can be used to repeat each element in place a certain number of times (before `times` is applied).
```{r}
rep(1:3, each = 2)
rep(3:4, 2, each = 2)
```
`length.out` defines the final length and truncates or extends the repetition accordingly.
```{r}
rep(1:3, length.out = 8)
rep(1:3, each = 2, length.out = 5)
```
#### Vectorization
Many R operations are vectorized, so vectors can be fed and a vector of pairwise results will be returned.
```{r Vectorized operations}
x <- 1:4; y <- 6:9
x+y
x>2
x>=2
y==8
y!=8
x*y
x/y
```
Be careful to distinguish logical operators: single operators work elementwise, while double operators summarize the full vector result.
```{r Logical operations on vectors}
x <- c(T,F,F,T); y <- c(F,T,F,T)
x & y
x && y
x | y
x || y
```
If vectors of different lengths are passed, the shorter vector(s) will loop to satisfy maximum length.
```{r Looping vectors}
x <- paste(c("X","Y"),1:4,sep="."); x
y <- paste(c("X","Y"),1:4,6:10,sep="."); y
```
### More On Strings {.tabset .tabset-pills}
**This is a work in progress. Need to build out stringr for tidyverse.**
Start [here](http://r4ds.had.co.nz/strings.html), and check against [stringr](http://stringr.tidyverse.org/) for depth if needed.
#### Base R
The base package includes functions for string manipulations. `paste()` converts vectors to characters and concatenates them element-wise, each separated by a `sep` sequence, defaulting to `' '`. The output vector itself can be concatenated together by specifying a separating `collapse` character. (`paste0()` does the same but with `sep=''`). `strsplit()` does the opposite and will split a vector into a list of vectors on a specified split character.
```{r paste(), tolower(), strsplit()}
x <- paste(c("X","Y"),1:4,sep=".")
tolower(x)
paste(x,collapse="|")
strsplit(x,"\\.")
```
`tolower`, `chartr`
`print`, `substr`, `cat`
See `?chartr` for further base functions on character translation and casefolding.
#### Regular Expressions
Regular expressions (regex) are sequences of characters used for search patterns for matching strings. Basics are covered here, [more information on regular expressions is here](./regex.nb.html).
#### Regex patterns
**Put these in a table**
Basic regex patterns are:
- `^a`: Starts with `a`
- `a$`: Ends with `a`
- `a|i`: `a` or `i`
- `[a-z]`: Any lowercase character
- `[A-Za-z]`: Any uppercase or lowercase character
- `[0-9]`: Any digit
Wildcards and repeats
- `.`: wildcard for any character
- `.*`: repeats the wildcard zero or more times
Escapes
- `\\.`: Escapes and matches '.'
- `\\s`: Escapes and matches ' ' (space)
Groups
- `([a-z])`: Captures a metacharacter/group
- `\\1`: 'Backreference' used to later refer to the captured group, in order
#### Search and replacement
`grep` comes from Unix and stands for **g**lobal search **r**egular **e**xpression **p**rint. `grepl` produces a Boolean for matching a regex pattern in a character vector. `grep` returns the indices.
```{r grep() and grepl()}
kids <- c("Greg", "Jan", "Cindy", "Marcia", "Bobby", "Peter")
grepl("a",kids)
grep("e",kids)
```
`sub` and `gsub` are used for replacement. A regex `pattern` is replaced with a `replacement` value in `x`, a vector of strings to check against. `sub` only replaces the first instance of `pattern` found, while `gsub` works on all instances.
```{r sub() and gsub()}
kids <- c("Greg", "Jan", "Cindy", "Marcia", "Bobby", "Peter")
sub(patt="a|e|i|o|u|y",repl="a",x=kids)
gsub(patt="a|e|i|o|u|y",repl="i",x=kids)
rpt <- c("Writing 5 reports tonight.",
"9 news is running a report...",
"We'll report on 5 findings.",
"3 reports issued earlier.")
sub(".*([0-9]+)\\sreport.*$", "\\1", rpt)
```
#### String Distance: Fuzzy matching
Use `adist()` to calculate approximate string distances for fuzzy matching (many other parameters can be fed to `adist` to customize its evaluations):
```{r adist()}
codes <- c("male","female")
gender <- c("M","male","Female","fem.")
D <- adist(gender, codes)
colnames(D) <- codes; rownames(D) <- gender; D
# Carry out the best match
i <- apply(D, 1, which.min) # Applies which.min row-wise on D
data.frame(rawtext=gender,coded=codes[i])
```
Alternatively, `agrep()` allows one to use distance in search. Compare the distance between a pattern and a vector to be searched for matches. (In computing distance, the vector elements will be searched for **substring** matches, not whole elements - this makes it unsuitable for e.g. the task above.) The search can be customized by providing costs or other parameters, often in a list to the argument `max.distance`. Other options like `value` and `ignore.case` help control the search and output value. Finally, `agrepl()` performs the same search, but returns a Boolean instead.
```{r agrep()}
agrep("lasy", "1 lazy 2")
agrep("lasy", c(" 1 lazy 2", "1 lasy 2"), max = list(sub = 0))
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2)
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, ignore.case = TRUE)
agrepl("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, ignore.case = TRUE)
```
See also the `stringdist` package for more string distance metrics.
#### Other string packages
**Really need to build out stringr here: tidyverse**
Other string packages are also useful for basic string manipulation. More in-depth use is explored elsewhere.
```{r Basic stringr use}
library(stringr)
str_trim(" hello world ")
str_trim(" hello world ", side = "left")
str_trim(" hello world ", side = "right")
str_pad(112, width = 6, side = "left", pad = 0)
```
### Factors {.tabset .tabset-pills}
#### Overview
**Work in progress - need to build out forcats. How to summarize - count, table? Get from DSR**
Hadley addresses factors both in [Advanced R](http://adv-r.had.co.nz/Data-structures.html#attributes) as well as using `forcats` in [R for Data Science](http://r4ds.had.co.nz/factors.html).
Factors (and how they are set up) are important for many built-in statistics methods and functions. These are essentially *categories used to label observations* in a set of data. Factors are actually built on integer vectors, with the special attribute `levels` that assigns a string to each distinct category (integer).
Factors can be accidentally introduced in data loading. Character data is often read in as factors, which can be controlled using the argument `stringsAsFactors`, which defaults to `TRUE`. Sometimes numeric data will also be read in as a factor, notably if there is a non-numeric value in the column, such as `-` to indicate missing values; this can often be controlled using an argument such as `na.strings` to the reading function.
**Warning:**
While factors look like character vectors, they are actually integers underneath. Different functions will handle them differently - either coercing to strings, throwing errors, or handling them as integers. **It’s usually best to explicitly convert factors to character vectors if you need string-like behaviour.**
```{r factor summary}
x <- c(rep(c("lo","hi"),each=4),"mid")
table(x)
```
#### Factor Creation
`factor()` can encode a vector as a factor variable. The `levels` argument allows specification of the factors and their order; if this is not supplied, levels are assigned that cover all occurring values, ordered by the system (usually alphabetically). (Select values can be excluded by providing a vector to `exclude=`, and the factor levels can be re-ordered based on the vector using `forcats::fct_inorder`.) If levels are supplied, any values in the vector but not in the levels will become `<NA>` (which is excluded as a factor level by default; this can be changed, but should rarely be done). Factors can also be sorted on the order of their levels.
```{r factor() part 1}
x <- c(rep(c("lo","hi"),each=2),"mid")
table(x)
(f <- factor(x))
sort(f)
library(forcats)
fct_inorder(f)
# Note definition of "med" and not "mid"
(y <- factor(x, levels = c("hi","med","lo")))
# This operation is the same as factor()
y[, drop = TRUE]
# Factors cannot be meaningfully combined, they coerce back to an integer
c(f,y)
typeof(c(f,y))
```
The `labels` argument allows for updating the factors themselves when creating a factor, using either a character vector (in the same order as the levels) or a single value that will be appended with a number appended based on the levels ordering.
```{r factor() part 2}
(factor(x, labels = c("high","low","medium")))
(factor(x, labels = "temp"))
```
Finally, for creating simple patterned factors, `gl()` may be used. Its arguments are:
- `n`, the number of levels
- `k`, the number of replications of each level
- `length`, the final length of the result (can be used to cause the output to loop)
- `labels`, optional labels to be applied (default is sequential integers)
- `ordered`, whether the factor is ordered
```{r gl() for patterned factors}
gl(2, 4, labels = c("Control", "Treat")) # First control, then treatment:
gl(2, 1, 8) # 8 alternating 1s and 2s
gl(2, 2, 8) # alternating pairs of 1s and 2s
```
#### Manipulating Levels
A factor has attributes `class = "factor"` and `levels`, a character vector. `levels()` can be used to retrieve this vector or to relabel level values, either for the whole set (either with a vector ordered as the existing levels, or with a named list) or for a defined subset of levels.
```{r levels() part 1}
attributes(f)
levels(f)
attr(f, "levels") # Same as previous
(levels(f) <- c("high", "low", "medium"))
(levels(f)[3] <- "middle")
(levels(f) <- list("C" = "low", "A" = "high", "B" = "middle"))
```
Additional (empty) levels can also be defined with this method, and in a similar manner as relabeling, factor levels can be combined (and the total number of levels reduced). Use of lists for relabeling and/or combining factor levels will also reorder the new set of labels based on the list order.
Finally, `nlevels()` produces a count of the number of levels. `relevel(x, ref)` can be used to move a particular `ref` level to the first position, and moves the other levels down in order, which can be useful for statistical methods.
```{r levels() part 2}
(levels(f) <- c("C","A","B","D"))
(levels(f) <- list("vowel" = "A",
"consonant" = c("B","C","D")))
nlevels(f)
relevel(f,"consonant")
```
#### Factor Order
Ordered factors differ from factors only in their class, but methods and the model-fitting functions treat the two classes quite differently. Providing the argument `ordered = TRUE` to `factor()` makes a factor ordered, and this can be checked with `is.ordered()` or coerced on a non-ordered factor with `as.ordered()`. Ordered factor level values may be compared with inequalities, where unordered factor levels may not.
```{r ordered()}
(fac <- factor(x))
(ord <- factor(x, ordered = TRUE))
class(fac)
class(ord)
is.ordered(fac)
is.ordered(ord)
as.ordered(fac)
fac[1] > fac[3]
ord[1] > ord[3]
```
`reorder(x, X, FUN)` can reorder a factor `x` (ordered or not) based on some function `FUN` that takes a vector and returns a scalar. Based on the levels of `x`, `reorder` takes each subset of `X` and then returns `x` but with the levels ordered in increasing value of the scalar returned by `FUN` on that level's subset in `X`. This can be helpful for data visualization. Modify `median` in the code below to choose different statistics for ordering the data:
```{r reorder()}
bymedian <- with(InsectSprays, reorder(spray, count, median))
boxplot(count ~ bymedian, data = InsectSprays,
xlab = "Type of spray", ylab = "Insect count",
main = "InsectSprays data", varwidth = TRUE,
col = "lightgray")
```
### Times and Dates?
**Work in progress - really need to build out lubridate here: tidyverse**
**zoo and xts tutorial on DataCamp**
Basic date control is covered below. Three other packages to review for creating and manipulating dates and times:
- [`lubridate`](https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html) is part of the tidyverse and simplifies parsing and extracting information from date and time objects, as well as manipulating time intervals.
- [`zoo`](https://cran.r-project.org/web/packages/zoo/index.html) is focused on manipulating time series
- [`xts`](https://cran.r-project.org/web/packages/xts/index.html) extends `zoo` for more uniform handling
Date (`Date`) and time (`POSIXct`) objects can be instantiated from the internal system time.
```{r System date and time}
today <- Sys.Date()
class(today)
now <- Sys.time()
class(now)
```
Date objects may also be coerced using `as.Date()`, using a standard format such as `%Y-%m-%d`, or by setting the `format` explicitly. Similarly, time objects may be coerced using `as.POSIXct()`. Date or time objects may be displayed as strings in a particular format using `format()`.
- `%Y`: 4-digit year (1982)
- `%y`: 2-digit year (82)
- `%m`: 2-digit month (01)
- `%d`: 2-digit day of the month (13)
- `%A`: weekday (Wednesday)
- `%a`: abbreviated weekday (Wed)
- `%B`: month (January)
- `%b`: abbreviated month (Jan)
- `%H`: hours as a decimal number (00-23)
- `%I`: hours as a decimal number (01-12)
- `%M`: minutes as a decimal number
- `%S`: seconds as a decimal number
- `%T`: shorthand notation for the typical format %H:%M:%S
- `%p`: AM/PM indicator
```{r Entering dates and times}
(my_date <- as.Date("1999-12-31"))
class(my_date)
(sec_date <- as.Date("1971-14-05", format = "%Y-%d-%m"))
(my_time <- as.POSIXct("1971-05-14 11:25:15"))
format(my_time, "%b %d, %Y")
format(my_time, "%I:%M %p")
```
Dates and times may be added or subtracted. This results in a `difftime` object.
```{r Manipulating dates and times}
(my_date + 1)
my_date - sec_date
(my_time + 1)
now - my_time
```
## Lists {.tabset}
Unlike vectors, lists can contain elements of different classes, including vectors or other lists.
```{r list()}
z <- list(a = 1:3,b = "a",TRUE,1+4i,g = list(a = 1:4,b = "5"))
str(z)
length(z)
typeof(z)
```
Lists can be concatenated into a single list with `c()`. (Conversely, `list()` will keep each entry as its own list element in a new higher-level list.) If given a combination of atomic vectors and lists, `c()` will coerce the vectors to lists before concatenating them. Compare the results of `list()` and `c()`:
```{r concatenating lists}
x <- list(list(1, 2), c(3, 4))
y <- c(list(1, 2), c(3, 4))
str(x)
str(y)
```
Lists are used to build other data structures, such as data frames and many function returns. Therefore `is.list()` is more expansive than may be desired. Use `as.list()` to coerce to a list, or `unlist()` to turn a list into a vector, with the same coercion rules as `c()`. Names are also translated as depicted.
```{r unlist()}
is.list(mtcars)
str(as.list(1:3))
unlist(z) #From above
```
## Matrices and Arrays {.tabset}
### Creation
If needed, more thorough treatment is available in the [R intro manual chapter on arrays and matrices.](cran.r-project.org/doc/manuals/r-release/R-intro.html#Arrays-and-matrices)
Matrices and arrays are vectors with a 'dim' attribute (an integer vector); matrices have two dimensions (`dim` length 2) and arrays have more than two. `matrix()` and `array()` may be used to construct (Vectors and lists can also be modified into matrices or arrays by setting `dim()`.)
```{r}
(m <- matrix(1:6, nrow = 2, ncol =3)) # Don't need to specify both of these
attributes(m)
(n <- array(1:16,c(2,4,2)))
```
`as.matrix()` and `as.array()` also make it easy to turn an existing vector into a 1d matrix or array.
```{r}
v <- 1:3
as.matrix(v)
as.array(v)
```
A `data.frame` can be converted to a matrix using `data.matrix()`.
```{r}
df <- data.frame(a=1:4,b=8:5)
str(df)
str(data.matrix(df))
```
As seen, matrices are constructed column-wise, unless specified using `byrow = TRUE`.
```{r}
(n <- matrix(1:6, nc = 3, byrow = T))
```
`c()` generalises to `cbind()` and `rbind()` for matrices, and to `abind()` (provided by the `abind` package) for arrays.
```{r}
x <- 1:3
y <- 10:12
cbind(x,y)
rbind(x,y)
```
You can transpose a matrix with `t()`; the generalised equivalent for arrays is `aperm()`.
```{r}
t(n)
```
### Basic Properties
You can test if an object is a matrix or array using is.matrix() and is.array(), or by looking at the length of the dim().
```{r}
# Same matrices as used on other tabs
m <- matrix(1:6, nrow = 2, ncol =3, dimnames = list(c("A","B"),c("a","b","c")))
n <- array(1:16, dim = c(2,4,2))
is.matrix(m)
is.array(m)
is.array(n)
is.matrix(n)
```
Like vectors, matrix operations are vectorized by element.
```{r}
x <- matrix(1:4, 2, 2); y <- matrix(rep(10,4),2,2)
x*y
x/y
```
True matrix multiplication requires `%*%`.
```{r}
x%*%y
```
### Attributes
Attributes can be set by `matrix()` or `array()` at creation, or assigned to by name afterward.
```{r}
# Same matrices as used on other tabs
m <- matrix(1:6, nrow = 2, ncol =3, dimnames = list(c("A","B"),c("a","b","c")))
n <- array(1:16, dim = c(2,4,2))
```
`length()` generalizes to `nrow()` and `ncol()` for matrices, and simply `dim()` for arrays.
```{r}
length(m)
nrow(m)
ncol(m)
dim(n)
```
`names()` generalizes to `rownames()` and `colnames()` for matrices, and to `dimnames()`, a list of character vectors, for arrays. (Note that `rownames` and `colnames` are **not** arguments to `matrix()`.)
```{r}
colnames(m)
dimnames(n) <- list(c("one", "two"), c("a", "b", "c","d"), c("A", "B"))
n
```
## Data Frames {.tabset}
### data.frame {.tabset .tabset-pills}
#### Intro
A data frame is a special type of list where each element is a vector of equal length. Thus each element (vector) is a column (each with its own data type) and vector lengths are # of rows.
Notably, this means that the 'primary' dimension of a data.frame is its columns, which are the list elements. Therefore, `names()` is the same as `colnames()` and, perhaps counterintuitively, `length()` is `ncol()`.
#### Creation
`data.frame` takes named vectors as input. (Vector length is checked, but names will be imputed based on values if missing.)
```{r}
(x <- data.frame(1:4, bar=c(T,T,F,F), cat=c("a","b","c","d")))
nrow(x)
ncol(x)
```
Note above that the default behavior is to coerce strings to factors. Use stringAsFactors = FALSE to suppress this behavior.
```{r}
(x <- data.frame(1:4, bar=c(T,T,F,F), cat=c("a","b","c","d"), stringsAsFactors = FALSE))
```
A data frame may be coerced using `as.data.frame()`.
- A vector will create a one-column data frame.
- A list will create one column for each element; it’s an error if they’re not all the same length.
- A matrix will create a data frame with the same number of columns and rows as the matrix.
Combine data frames column-wise using `cbind()`. Number of rows must match, but row names are ignored. (Note that one argument to `cbind()` must already be a `data.frame`, or else a matrix will be returned; if binding vectors or lists, simply use `data.frame` directly.)
```{r}
cbind(x,x=1:4)
```
Combine data frames row-wise using `rbind()`. For this, both the number and names (order) of columns must match. This can present some difficulties so look for packages to help facilitate this type of combination.
```{r}
rbind(x,1:3)
```
#### Basic Properties
A `data.frame` is a class, where `list` is the type:
```{r}
df <- data.frame()
typeof(df)
class(df)
is.data.frame(df)
```
#### Attributes
As rectangular data, a `data.frame` has `colnames()` and `rownames()` as well as `ncol()` and `nrow()`. Names may be set at creation (either by naming the column vectors or by providing `row.names=`) or after creation.
```{r}
x <- data.frame(value = seq(1,10,len=4), large = c(T,T,F,F), row.names = 1:4)
nrow(x)
colnames(x) <- c("A","B")
x
```
### tibble {.tabset .tabset-pills}
**Work in progress, will revisit with tidyverse**
### data.table {.tabset .tabset-pills}
**Work in progress, will revisit later**
```{r}
library(data.table)
DF <- data.frame(x=rnorm(9),
y=rep(c("a","b","c"),each=3),
z=rnorm(9))
head(DF,3)
DT <- data.table(x=rnorm(9),
y=rep(c("a","b","c"),each=3),
z=rnorm(9))
head(DT,3)
tables()
```
Subsetting rows - similar to data frames
```{r}
DT[2,]
DT[DT$y=="a",]
DT[c(2,3)]
```
Subsetting columns - different
```{r}
DT[,c(2,3)] #not desired result
```
Leading comma allows one to pass 'expressions'
```{r}
DT[,list(mean(x),sum(z))]
DT[,table(y)]
DT[,w:=z^2]
DT2 <- DT
DT[,y:=2]
head(DT,n=3)
head(DT2,n=3)
```
Multiple operations
```{r}
DT[,m:= {tmp <- (x+z); log2(tmp+5)}]
DT[,a:=x>0]
DT[,b:=mean(x+w),by=a] #conditioned on factor (a)
```
Special variable: .N is int len 1, containing number
```{r}
set.seed(123)
DT <- data.table(x=sample(letters[1:3], 1E5, TRUE))
DT[, .N, by=x]
```
Keys
```{r}
DT <- data.table(x=rep(c("a","b","c"),each=100),
y=rnorm(300))
setkey(DT,x)
DT['a']
```
Joins (inner join here)
```{r}
DT1 <- data.table(x=c('a','a','b','dt1'),y=1:4)
DT2 <- data.table(x=c('a','b','dt2'),z=5:7)
setkey(DT1,x)
setkey(DT2,x)
merge(DT1,DT2)
```
Fast reading
```{r}
big_df <- data.frame(x=rnorm(1E6),y=rnorm(1E6))
file <- tempfile()
write.table(big_df,
file=file,
row.names=FALSE,
col.names=TRUE,
sep="\t",
quote=FALSE)
system.time(fread(file)) # ~8x faster
system.time(read.table(file,header=TRUE,sep="\t"))
```
## Credits
These notes are accumulated from many sources, but is particularly is indebted to [Advanced R](http://adv-r.had.co.nz/Data-structures.html).