-
Notifications
You must be signed in to change notification settings - Fork 58
/
Copy pathWorkData.rmd
1418 lines (1013 loc) · 42.8 KB
/
WorkData.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Working with Data"
author: "Joshua F. Wiley"
date: "`r Sys.Date()`"
output:
tufte::tufte_html:
toc: true
number_sections: true
---
```{r, include=FALSE, echo=FALSE}
library(tufte)
```
# Intro
Monash has asked that we watch
[this video](https://www.youtube.com/watch?v=pIWEMSoDUZY&feature=youtu.be).
`r margin_note("Core ideas and code are in the main document. Extra information that may help understanding but is not the big picture I will go in margin notes, like this. Remember that if you use GitHub desktop, you should already have the Rmarkdown file on your computer (see Week 1).")`
To start off the lecture, please make sure that you:
Have downloaded the `R`markdown file for the lectuere here
[https://jwiley.github.io/MonashHonoursStatistics/WorkData.rmd](https://jwiley.github.io/MonashHonoursStatistics/WorkData.rmd)
and that you have RStudio open and have opened the project you are using for
this class and have the `R`markdown file for this lecture open on
your own laptop (or are following along with someone else).
# Basic `R` Use
As a brief refresher, we'll start just by using `R` as a basic calculator.
```{r calculator}
3 + 4 # addition
2 * 6 # multiplication
4 / 2 # division
3^2 # powers [2 squared]
```
`r margin_note("Objects in R can be almost anything. They can be a single number, a vector or set of numbers, the results from a regression, a graph, a dataset. You can have many objects at the same time in your worksapce, which can be a useful way to save results or store something for later use.")`
In `R` many things you do with an individual number can be done with a
vector, a set of numbers. Here's a quick refresher on creating
vectors, using `c()` and on storing them in an object, here I used
`x`.
Once you have created an object, it should show up in RStudion in the
"Environment". In addition, we can view or operate on all the elements
of an object by referring to it by name.
```{r vectors}
c(1, 5, 4) ## create a vector
x <- c(1, 3, 5) ## asign vector to x
x ## view x
x + 2 ## add 2 to every element of x
```
`r margin_note("Computers do not work quite like people or
regular math. Usually, numbers are represented using
floating-point arithmetic (https://en.wikipedia.org/wiki/Floating-point_arithmetic)
often a binary representation is used as much in the computing world
is based on bits stored in 0s and 1s. For example, with a binary
based, 1/5 cannot be represented exactly. An implication of this is
that sometimes numbers are not represented in R quite like you might
think. For example: 1/5 == 0.2000000000000001 is evaluated as false
but 1/5 == 0.20000000000000001 is evaluated as true. The default
precision for floating points in R cannot distinguish those two
numbers.")`
Finally, here are a few quick tips on formatting `R` output.
Let's say we have a number with lots of decimal points, we might not
always need all that precision.
You can use `R` to round for you, using the `round()`
function. You can also use the `options()` function to
control the default options for how `R` prints output. This is helpful
if you want everything printed to your screen to be rounded more. One
difference is that `options()` actually uses
[significant figures](https://en.wikipedia.org/wiki/Significant_figures)
so you will not get a fixed number of decimal points, rather you get a
fixed number of significant figures.
`r margin_note("The round function has two main arguments (i.e., pieces
of input it needs from you). The first is the number or vector to be
rounded. It can be a single number or a whole vector of numbers. The
second is how many digits you want R to use in the rounding.
The results may not always print the number of digits you requested,
which may be confusing. This can happen when one of the digits rounds
to 0 (e.g., 1.199 rounds to 1.20 and the 0 would be dropped so
1.2).")`
```{r format}
1.214294254
round(1.214294254, digits = 2)
options(digits = 2) ## set default for the R session
1.214294254
```
# Working with Data
To continue on, we're going to open a few of the packages that you
should have already installed from Week 1.
```{r packages}
library(data.table)
library(reshape2)
library(JWileymisc)
library(extraoperators)
```
Next, we are going to load some sample data. This is simulated data
based on a 12-day study where people reported a few measures at
baseline along with stress and affect three times per day.
After loading it, we convert it to a data.table format, using the
`as.data.table()` function so that we can use many of the useful tools
from the `data.table` package for data management.
```{r data}
data(aces_daily)
d <- as.data.table(aces_daily)
```
## Operators
`r margin_note("In software, logical values and logical operators are
often used to refer to things that return a boolean value. Booleans
are one of two values, conventionally: TRUE or FALSE. By
convention if you do any arithmetic on them, TRUE is treated as 1
and FALSE is treated as 0.")`
A lot of data management involves using logical operators. Operators
take data on the left hand side and a few options arguments on the
right hand side. Logical operators return `TRUE` or
`FALSE`.
We use logical operators when working with data for lots of
reasons. You might want to find outliers, values that are greater than
or less than a specific score, check if values fall within the
expected range, recode continuous variables into categorical variables
(low, medium, high, etc.). Much of this work is underpinned by logical
operators. Here is a list of some operators from `R` and expanded by
the `extraoperators` package.
| Operator | What it does |
|------------------|---------------------------------------------------|
| `<-` | assign whatever is on right to the left in R |
| `:=` | assign right hand side to the left in data.table |
| `==` | Are values / vectors equal |
| `!=` | Are values / vector NOT equal |
| `<` OR `%l%` | Less than |
| `<=` OR `%le%` | Less than or equal |
| `>` OR `%g%` | Greater than |
| `>=` OR `%ge%` | Greater than or equal |
| `%gl%` | Greater than AND less than |
| `%gel%` | Greater than or equal AND less than |
| `%gle%` | Greater than AND less than or equal |
| `%gele%` | Greater than or equal AND less than or equal |
| `%in%` | In |
| `%!in%`OR `%nin%`| Not in |
| `%c%` | Chain operations on the RHS together |
| `%e%` | Set operator, to use set notation |
## Subsetting Data
A common task in analyses is to subset the data. There are many
reasons for this, from ecluding outliers to selecting only
participants who meet some criteria you want to use.
Sometimes, the order you choose to subset matters too.
Most often, we subset data by using logical operators to pick specific
rows of a dataset or specific values from a single variable. Below, we
select cases/observations where Age = 18, this is done using the
logical operator `==`. `R` tests if the variable "Age" is equal to 18
and if it is returns TRUE and otherwise FALSE. Rows where the test
returns TRUE are returned. So we don't see a giant dataset, we just
pick two variables: UserID and Age.
```{r}
d[Age == 18, .(UserID, Age)]
```
Suppose we decide that under 20y or above 25y are extreme ages for our
data. We might want to run analyses in data excluding people outside
that range. We could create a copy of the dataset, `d.noout`, with
"outliers" removed where we only include cases that are between
20-50y. Here we use `%gele%` to capture people greater than or equal
to 20y and less than or equal to 25.
```{r}
d.noout <- d[Age %gele% c(20, 25), .(UserID, Age)]
## frequency table for ages
table(d.noout$Age)
## view a few rows of the data
head(d.noout)
```
Sometimes, a single variable is not sufficient.
You can chain conditions together using the operators:
`&` and `|` which represent "and" and "or", respectively.
```{r}
## select only women who are 18y
d[Age == 18 & Female == 1, .(UserID, Age)]
## select anyone who is either a woman OR 18y
d[Age == 18 | Female == 1, .(UserID, Age)]
```
If you want to select several values, the `%in%` operator is
helpful. It tests if the variable/data on the left is in the set on
the right hand side. This is much faster than writing many "ors"
```{r}
## select anyone whose age is in 18, 19, or 20
d[Age %in% c(18, 19, 20), .(UserID, Age)]
```
You can chain even more logical requirements together by using
parentheses. Operations occur within parentheses first.
```{r}
## 19 year old women or 18 year old men
d[(Age == 19 & Female == 1) | (Age == 18 & Female == 0),
.(UserID, Age)]
```
Other common operators are greater than or less than, `>` and `<` or
greater than or equal to and less than or equal to, `>=` and `<=`.
```{r}
## anyone under age 20
d[Age < 20, .(UserID, Age)]
## anyone age 20 or under
d[Age <= 20, .(UserID, Age)]
## anyone who is not age 20
d[Age != 20, .(UserID, Age)]
## anyone who is not age 18, 19, or 20
d[Age %!in% c(18, 19, 20), .(UserID, Age)]
```
Finally, let us see why the order of operations may matter. Suppose
that we want to look at people who have at least 30 non-missing stress
values and look at just observations where STRESS > 3.
First, we'll remove cases that are missing on `STRESS`. We do this
using the `is.na()` function, which returns `TRUE` if a value is NA
(NA in `R` means not available, missing) and `FALSE` if a value is not
missing. Then we negate that using `!` to get "not missing".
```{r}
## remove missing stress observations
d2 <- d[!is.na(STRESS)]
```
Next, we can create a new variable in the dataset, which we'll call
"Count" and we'll assign this the number of observations that occur
by ID.
In `R` `data.table`s have three main parts
`DT i, j, by` the DT represents the name of the data table, the `i`
represents which row(s) we want to select. If we leave it blank, that
means select all rows. After the first comma, the `j` represents
columns/variables. That could be selecting only certain
columns/variables to display *or* creating / modifying a column or
variable. The last part, the `by` represents a grouping variable or
some way or organising our operations. For example, we might want to
perform the same operation for each ID in the dataset. Lastly, if
there are NO commas, that means just give me the
rows/cases that match my criteria, but give me all columns/variables.
Here we take all cases (we aren't using any subsetting yet) and then
create a new variable Count, we use the assignment operator in
`data.table`, `:=` which means, assign to the thing on the left, in
this case a new variable, whatever the value on the right is. Here the
value on the right is `.N` that means the "n" or count / sample
size. Finally, we have `data.table` do this by ID, so that we don't
get the sample size of the entire dataset, we get the sample size, the
number of surveys, completed by each ID or person.
```{r}
d2[, Count := .N, by = UserID]
```
Now we can take people who have at least 30 non-missing stress values,
and look at high stress scores.
```{r}
d2 <- d2[Count >= 30]
d2 <- d2[STRESS > 3]
d2[, .(UserID, STRESS)] ## IDs and number of obs
```
This whole sequence could be collapsed or chained together to write
less code. In data tables in `R` you can chain operations by just
adding more brackets as below. This accomplishes the same as the above.
```{r}
d2 <- d[!is.na(STRESS)][, Count := .N, by = UserID][
Count >= 30][STRESS > 3]
d2[, .(UserID, STRESS)] ## IDs and number of obs
```
`r margin_note("Here, the order is: first, remove missing stress
observations, second take only surveys/rows where stress scores > 3,
third, count how many observations are not missing by ID.")`
Now look what happens if we first take stress values greater than 3.
That is we change the order of the operations. We end up with much
fewer people.
```{r}
d2 <- d[!is.na(STRESS)][STRESS > 3][,
Count := .N, by = UserID][Count >= 30]
d2[, .(UserID, STRESS)] ## IDs and number of obs
```
In this new format, we are only taking people with 30
non missing stress values > 3 whereas before we took people with 30
non missing stress values (any value) and then only observations where
stress > 3.
## You Try It - Subsetting
Select observations where:
- positive affect, in the dataset, called `PosAff` is not missing
- participants who have **at least 20** non missing positive affect (any value)
- where positive affect scores are greater than or equal to 3
Use the template below filling in values where appropriate.
You should end up with 2,455 observations.
```{r trysubset, echo = TRUE, eval = FALSE}
d3 <- d[!is.na( )][, Count := .N, by = UserID][ ][ ]
d3[, .(UserID, PosAff)] ## IDs and number of obs
```
## Finding Specific Cases or Outliers
Suppose that based on graphs or other information, we had decided that
negative affect scores above 4 were outliers and that participant 56
was an outlier overall.
We can exclude ID 56 and select only observations with negative affect
at or below 4 as below.
```{r}
d[UserID != 56 & NegAff <= 4,
.(UserID, NegAff)]
```
# Data Types
There are many types of data. In `R`, you can find what type of data a
particular variable is by asking for its `class()`.
```{r}
class(d$Age)
class(d$BornAUS)
class(names(d))
```
- *Logical*: Used for logical data, which are either TRUE or FALSE. If data are logical, it is a very efficient format and useful for many cases. Logical variables can be compared. Arithmetic can be used for logical variables, in which case TRUE is treated as 1 and FALSE as 0.
- *Integer*: Used for integer type data, that is whole numbers like 0, 1, 2. For variables that are only whole numbers, integer format is more efficient than real numbers or numeric data (e.g., 1.4).
- *Numbers/real/numeric*: Used for real numbers, such as 1.1, 4.8. It also can be used for integer data (i.e., whole numbers only) but is a less efficient format. In `R` these are represented by the class `numeric` abbreviated `num`.
- *Text/character/string*: Used for text type data, such as names, qualitative data, etc. Also, any numbers can be stored as strings. In `R` these are represented by the class `character` abbreviated `chr`. Character data do not work with arithmetic operators, but can be sorted (e.g., alphabetically).
Although these are the basic building blocks of data, there also are special formats that can be built off of these, often off of numeric type data.
- *Date*: A special representation of numeric data when the data are dates. In `R` represented by multiple classes, including `Date`, `POSIX` and `chron`. The standard approach is to count the days since some origin. In `R`, the default origin in 1970-01-01. So 1970-01-01 is 0. 2 would indicate 1970-01-03. You can do some arthmetic on dates. For example adding or subtracting numbers, calculating the differece between dates.
- *Date and time*: A special representation of numeric data when the data are dates and times. In `R` represented by multiple classes, including `POSIX` and `chron`. The standard approach is to count the seconds since some origin. In `R`, the default origin in 1970-01-01 00:00:00. So 1970-01-01 00:00:00 is 0. 2 would indicate 1970-01-01 00:00:02. You can do some arthmetic on datetime objects. For example adding or subtracting numbers, calculating the differece between dates and times.
- *time*: A special representation of numeric data when the data are times. In `R` represented by class `chron`. 24 hours are formated so that they go from 0 to 1. For example, 3:31pm would be converted to 24 hour time: 15:30 and then to decimals for hours: 15.5 then converted ito range from 0 to 1: 15.5/24.
- *factor*: A special representation of numeric data when the data are fundamentally discrete. For example, if study condition is coded as 0 = control, 1 = medication, 2 = psychotherapy. The data could be stored numerically, or as text. If stored as a factor, R will do some convenient things automatically, such as dummy coding in regression models.
# Date Variables
Often when dates are read into R, they are character data.
We will work with some different character formats and
convert them to date class in `R`.
```{r, error = TRUE}
## sample character data
x1 <- c("2019-03-12", "2017-09-15")
x2 <- c("2019-Mar-12", "2017-Sep-15")
x3 <- c("12/3/2019", "15/9/2017")
x4 <- c("12/3/19", "15/9/17")
# these are character strings
class(x1)
## convert character strings to Date classes
as.Date(x1, format = "%Y-%m-%d")
as.Date(x2, format = "%Y-%b-%d")
as.Date(x3, format = "%d/%m/%Y")
as.Date(x4, format = "%d/%m/%y")
## save results
d1 <- as.Date(x1, format = "%Y-%m-%d")
## look at the class once converted
class(d1)
## arithmtic on dats
d1 + 5 ## addition
d1 - 7 ## subtract a week
d1 * 5 ## multiplication does not work: error
## extract just one date
d1[1]
d1[2]
## compare two dates
d1[1] - d1[2]
## compare two dates and convert to a number
as.numeric(d1[1] - d1[2])
## compare two dates and convert to a number
## and then to years
as.numeric(d1[1] - d1[2])/365.25
```
To convert date into a date class in `R`, we need to tell it the
format the dates are written in. This is done by specifying a
"template".
- "%Y" means 4 digit year [upper case]
- "%y" means 2 digit year [lower case]
- "%m" means digit month
- "%b" means abbreviated month (e.g., "Jan")
- "%d" means digit day
These pieces can be re-ordered an combined and use any separator. For
example "%Y-%m-%d" for YYYY-mm-dd or "%Y/%m/%d" for YYYY/mm/dd or "%Y
%m %d" for YYYY mm dd or "%d %m-%Y" for dd mm-YYYY. This flexibility
is needed as different people and countries commonly write dates in
many different formats.
## You Try It - Dates
Now you try to convert the following character strings into date class
objects in `R` by fillng in the correct template in quotes for the
`format = ` argument. If you do it incorrectly, you will get errors or
`NA` indicating the text could not be converted to a date uing the
template you supplied.
```{r trydates, echo = TRUE, eval = FALSE}
y1 <- c("5/28/2018", "6/30/2019")
as.Date(y1, format = "") ## you complete
y2 <- c("28 01 2018", "15 06 2019")
dy2 <- as.Date(y2, format = "") ## you complete
dy2 ## show results
y3 <- c("28-1-18", "15-6-19")
as.Date(y3, format = "") ## you complete
## how many days are there between
## 28 January 2018 and 15 June 2019?
## use the object dy2 to complete like
## we did above
```
## Date Helper Functions
Dates have a variety of helpful functions that let you operate on or
learn moe about them. We can use the `weekdays()`, `months()`, and
`quarters()` functions to find out the day of week, month or quarter a
particular date falls in. They return character strings.
```{r}
weekdays(d1)
months(d1)
quarters(d1)
```
Dates also can be compared to determine if one is greater than or less
than another. This can be useful for selecting observations that meet
specific criteria. For example, suppose we wanted to exclude any dates
that occurred before the semester started. Not everyone in the daily
diary study is a student, but if they were, we might expect positive
affect to be different on break than during semester. Suppose semester
started 1 March 2017, we might want to only include observations that
fell on or after that date.
Note here we do not need to specify a template because `R` will assume
YYYY-mm-dd when written this way.
```{r}
## summary of dates
summary(d$SurveyDay)
## select surveys where the survey date
## is greater than or equal to 2017-03-01
d2 <- d[SurveyDay >= as.Date("2017-03-01")]
## summary of dates
summary(d2$SurveyDay)
d2 <- d[weekdays(SurveyDay) %in% c("Monday")]
## frequency table of days of week
table(weekdays(d2$SurveyDay))
```
## You Try It - Subset Dates
Now you try to create a dataset that excludes any weekends.
```{r trysubsetdates, echo = TRUE, eval = FALSE}
## complete to make a dataset that
## excludes any weekend days
d3 <- d[ ]
## there should now be no surveys on weekends
table(weekdays(d3$SurveyDay))
```
# Factor Variables
Factor variables can be made from numeric or
integer or character data, but are converted into
the factor format to tell `R` that they are discrete
variables. This is done using the `factor()` function.
```{r}
## some sample data
x <- c(0, 1, 2, 1, 1, 2, 0)
## convert to a factor in R
## adding specific labels
y <- factor(x,
levels = c(1, 0, 2),
labels =c("Medication", "Control", "Psychotherapy"))
y
## if you do not specify levels or labels
## it will assume you want the levels
## ordered numerically / alphabetically
## and the labels should be the same as the levels
## factor with labels, sorted according to
## the level order (1, 0, 2)
table(y)
## default factor() approach, labels based on
## levels and sorted numerically: 0, 1, 2
table(factor(x)) ## automatic conversion to factor
```
## You Try It - Factors
Now try using factors for dummy coding. In the daily study,
there is a varible, SurveyInteger. 1 = morning, 2 = afternoon,
3 = evening. Make it a factor and then create a frequency table
using `table()`.
```{r tryfactors, eval = FALSE, echo = TRUE}
d[, SurveyIntegerF := factor(
SurveyInteger,
levels = c( ),
labels = c( ))]
table(d$SurveyIntegerF)
```
# Merging Data
Often data come from multiple sources and are in separate files. For
example, a baseline and post intervention questionnaire may be
collected separately, or you may have questionnaire data and sleep
data from a FitBit.
Data can be combined by merging, and merging in `R` is typically done
using the `merge()` function.
`r margin_note("Merging in R is the same as joins which is the more
common term for databases, such as SQL.")`
There are four main types of merges in `R` using the `merge()`
function. Assume we are merging two datasets: `x` and
`y`.
`r margin_note("Merging or joins always involves two datasets. Even if you need
to merge more than two datasets (e.g., you have baseline, post, and
follow-up data) you always merge / join them two at a time. For
example, first joining baseline and post into one bp dataset and
then joining bp and your follow up data. In R the x dataset is
always the left and the y dataset is always the right.")`
The type of merge performed is controlled by arguments to
the `merge()` function.
- **Natural join**: resulting data has only rows present in
**both** `x` and `y`. Argument: `all = FALSE` .
- **Full outer join**: the data has all rows in `x` and all rows
in `y`. Argument: `all = TRUE` .
- **Left outer join**: resulting data has all rows in `x`. Argument: `all.x
= TRUE` .
- **Right outer join**: resulting data has all rows in `y`. Argument: `all.y
= TRUE` .
`r margin_note("This small, hypothetical study collected data via surveys and
actigraphy. The surveys have peoples ages and the actigraphy has
their sleep duration, which we wish to join together for analysis.")`
We will explore joins using two, small datasets. `ID` is used
to indicate participant ID.
```{r}
surveys <- data.table(
ID = c(1, 2, 3),
Age = c(19, 18, 20))
acti <- data.table(
ID = c(2, 3, 4),
Sleep = c(8, 6, 7))
## view the two small datasets
print(surveys)
print(acti)
```
## Natural Join in `R`
`r margin_note("Natural joins have only the rows / observations that
are present in both datasets. In this example, the surveys and the
actigraphy. You can pick whichever dataset you want to be x and
and y by which you list first.")`
**Only rows present in both datasets**.
`r margin_note("When we write by = ID what is meant is that it should be
by the variable called ID. The word ID is not special, its just
that we happened to call the variable containing our IDs, ID. If
we had called the variable containings IDs Email or Name then
we would write: by = Email or by = Name. In other words,
match the text in quotes to an actual variable name in the data.")`
Joining data requires specifying how `R` should know which rows match
each other. This is typically done by an ID variable using the code:
`by = "ID"`. The argument, `all = FALSE` is what tells `R` this should
be a natural join.
```{r}
merge(
x = surveys,
y = acti,
by = "ID",
all = FALSE)
```
## Full Outer Join
`r margin_note("Full outer joins include cases that are present in either dataset. In this case, in either the surveys
or the actigraphy. You can pick whichever dataset you want to be x
and y by which you list first, although with a full outer join, it
does not make much difference, other than which columns/variables come
first and which come second. As before, we must specify what variable(s) to
merge by so R knows how to match rows together, accomplished using
the code: by = ID.
Note that R fills in missing values for rows /
observations and variables that are present in one dataset but not the
other. So if two datasets do not have identical rows / observations, a
full outer join will result in some missing values. These are shown in
the print out as NA")`
**All rows present in either dataset**.
The argument, `all = TRUE` is what tells `R` this should be a full
outer join. Full outer joins include all rows or observations.
```{r}
merge(
x = surveys,
y = acti,
by = "ID",
all = TRUE)
```
## Left Outer Join
`r margin_note("As before, we must specify what variable(s) to
merge by so R knows how to match rows together, accomplished using
the code: by = ID. Also as before, R fills in missing values for
the variables. In this case, missing values for the rows /
observations that are present in surveys, but not in acti (i.e., the
variable Sleep and ID = 1).")`
**All rows / observations that are present in left dataset
(surveys)**.
The argument, `all.x = TRUE` is what tells `R` this should be a left
outer join.
```{r}
merge(
x = surveys,
y = acti,
by = "ID",
all.x = TRUE)
```
## Right Outer Join
`r margin_note("As before, we must specify what variable(s) to merge by
so R knows how to match rows together, accomplished using the code:
by = ID. Note that R fills in missing values for the variables
in surveys that are not in acti and the rows / observations that are
present in acti, but not in surveys
(i.e., the variable Age and ID = 4).")`
**All rows / observations that are present in right dataset
(acti)**. The argument, `all.y = TRUE` is what tells `R` this should be a right
outer join.
```{r}
merge(
x = surveys,
y = acti,
by = "ID",
all.y = TRUE)
```
## Rows and missing after joining
`r margin_note("Knowing how many rows you expect gives you a quick way to
check whether the join is working as you want or something
unexpected may be happening. Practical experience is that joins
can easily have some errors resulting in unintended consequences,
like missing data or repetitions of rows that were not wanted.")`
It is helpful to know how many rows you expect in the
results.
`r margin_note("This behaviour is sometimes referred to as cross or
cartesian joining. When there are multiple matches for a row/ID in
both datasets, all possible combinations are returned. In this
case, rows 2 and 3 are both ID 2 in surveys2 and rows 1 and 2 are
both ID 2 in acti2, so those are crossed so you have all possible
pairings of rows from each dataset associated with ID 2.")`
In each of these two new datasets, ID 2 is repeated twice.
One might anticipate that a full outer join would return *5*
rows, (IDs 1, 2, 2, 3, 4). `R` actually returns **7** rows, with
four rows for ID 2, because all rows of ID 2 in `surveys2` are
repeated by all rows of ID 2 in `acti2`.
**Merging with any duplicates can behave in unexpected ways**.
In a small dataset, it is easy to see what has
happened. In a large dataset, checking whether the merge worked as
expected requires using summaries or other checks, such as evaluating
whether the number of rows matched what you wanted or needed.
```{r}
surveys2 <- data.table(
ID = c(1, 2, 2, 3),
Age = c(19, 18, 18, 20))
acti2 <- data.table(
ID = c(2, 2, 3, 4),
Sleep = c(8, 7, 6, 7))
## view small datasets
print(surveys2)
print(acti2)
merge(surveys2, acti2, by = "ID", all = TRUE)
```
## You Try It - Merging
Here are those two datasets shown as tables.
```{r datamerge}
ds1 <- data.table(
ID = c("A1", "B2", "C3"),
Stress = c(4, 5, 6))
ds2 <- data.table(
ID = c("A1", "B2", "D4", "E5"),
Steps = c(9524, 15303, 7432, 4035))
```
```{r, echo = FALSE, results = "asis"}
knitr::kable(ds1, caption = "Dataset: ds1")
```
```{r, echo = FALSE, results = "asis"}
knitr::kable(ds2, caption = "Dataset: ds2")
```
Before you try joining those datasets, try to answer these
questions, which are similar to what might be on the exam. The next
section has an answer guide.
For these questions, `ds1` is the "left"
dataset and `ds2` is the "right" dataset and you are joining by ID.
- How many rows would occur in a **natural join** of `ds1` and
`ds2`? How many missing values total across all
variables?
- How many rows would occur in a **full outer join** of `ds1` and
`ds2`? How many missing values total across all
variables?
- How many rows would occur in a **left outer join** of `ds1` and
`ds2`? How many missing values total across all variables?
- How many rows would occur in a **right outer join** of `ds1` and
`ds2`? How many missing values total across all variables?
```{r trymerge, error = TRUE}
## using the datasets ds1 and ds2, do a natural join
merge( )
## using the datasets ds1 and ds2, do a full outer join
merge( )
## using the datasets ds1 and ds2, do a left outer join
merge( )
## using the datasets ds1 and ds2, do a right outer join
merge( )
```
#### Answer Key
These are the sort of questions that would be on the
exam to assess understanding of joining / merging.
The key to these is to know what each join does and be
able to predict expected behaviour. A natural join will be only
those rows IDs present in both datasets, in this case: A1 and
B2. Because a natural join only includes rows present in all
datasets, you will not have any missing values on any variable,
unless there was missing values to begin with. So in this case,
you will have 2 rows and 0 missing values.
A full outer join will include all rows from either
dataset. That means you will have IDs A1, B2, C3, D4, and E5 (5
rows). Stress values are only present for A1, B2, and C3, so you will
have two missing stress values. Steps values are present for A1, B2,
D4, and E5, so there will be one missing steps value. Together you
will have 5 rows, and 3 missing values.
## Merging Data by Different IDs
Often data are under our control, but sometimes, they are not or even
if they are ID variables are not named consistently across datasets.
As long as the actual ID values match, the variable names do not need
to be the same. For example, in Qualtrics, if you create custom links,
IDs may be called "RecipientFirstName" while on the actigraphy data,
you may have called IDs, "ID". `R` can handle this easily by using the
arguments, `by.x` and `by.y` to give the variable names in the `x` and
`y` datasets, respectively.
```{r}
surveys2 <- data.table(
FirstName = c(1, 2, 2, 3),
Age = c(19, 18, 18, 20))
acti2 <- data.table(
ID = c(2, 2, 3, 4),
Sleep = c(8, 7, 6, 5))
merge(
x = surveys2,
y = acti2,
by.x = "FirstName",
by.y = "ID",
all = TRUE)
```
## Checking and Preparing Data for Merging
Because merging data can have unintended consequences if matched
incorrectly, often it is a good practice to check the data in advance
to make sure it is what you expect.
Generally, one-to-one merges, where each row of one dataset is matched
to one and only one row of another dataset is easiest. You can have
duplicates and these merge well in two cases: (1) the left, `x`
dataset has repeated measures but the right, `y` dataset only has one
row of data per ID or (2) the left, `x` dataset only has one row of
data per ID but the right, `y`, dataset has repeated measures per ID.
Many to many merges, involves *repeated IDs* in both datasets being
merged often is not what is intended. Note that repeated IDs does not
mean repeated measures. You can merge two repeated measures datasets
but not have repeated IDs by using two variables as your ID variable
or by creating unique IDs that combine ID and time, for example.
To check data, there are a few helpful functions in `R`.
The `anyDuplicated()` function returns 0 if no values are duplicated
and the position where the first duplicate occurs if any values are
duplicated.
```{r}
## there are duplicate IDs, starting in position 2