-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathBio720_AMorningofStrings_short.Rmd
224 lines (148 loc) · 7.62 KB
/
Bio720_AMorningofStrings_short.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
title: "BIO720_morning_activity_strings"
author: "Ian Dworkin"
date: "`r format(Sys.time(),'%d %b %Y')`"
output:
html_document:
keep_md: yes
toc: yes
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error =T)
```
# A lovely morning with strings
Ths morning we are going to work a bit with the string manipulation functions in R. R actually has many string manipulation functions, but they do not have a particularly standardized naming convention, nor do they have a common way of calling arguments (i.e. string first VS object you want to search etc...). So for some people this can be frustrating (it is actually not that bad). Some of this tutorial is based on an example from [here](https://francoismichonneau.net/2017/04/tidytext-origins-of-species/)
Given that, there are probably two very important R libraries to know about for string manipulation, *stringi* and *stringr*. There have been important co-developments of both these packages, but the two main things to know are that *stringr* just provides wrappers for functions in *stringi* (to make common uses a bit easier), and *stringr* has relatively limited functionality. They both also are consistent with *tidyverse* usage, as is one more library worth knowing about *tidytext*. There are a few others, like *rebus* which as a library uses tidyverse piping to make regular expressions easier to understand. However, I don't recommend this as much, as it is important to get used to writing regular expressions which are common among many programming languages.
All functions in *stringi* begin with `stri_..`, where ".." is the name of the function. For *stringr* all functions begin with `str_`.
## Install and load libraries you will need
You will want to install *stringi*, *stringr* and *tidytext*
```{r}
#install.packages("tidytext")
```
```{r}
library(stringr)
library(stringi)
```
## printing strings to the standard output (the console)
It turns out there are some subtleties that you want to know about with respect to printing things to the console. First thing to keep in mind is that "\" is the escape character so if you want to print the quote " This is "me"" to the console how would you do it (i.e. double quotes for everything). Compare your results using `print` vs `writeLines`
```{r}
print("this is \"me\"")
```
```{r}
writeLines("this is \"me\"")
```
also worth noting some of the differences in behaviours for vectors of strings.
```{r}
fly_seq <- c(A = "ACTGGCCA", B = "ACTGGCCT", C = "ACTGTCCA" )
print(fly_seq)
```
```{r}
writeLines(fly_seq, sep = " ")
```
```{r}
cat(fly_seq, sep ="\n")
cat(fly_seq)
```
How might you get `writeLines` to print each string on a newline?
```{r}
writeLines(fly_seq, sep ="\n")
```
## Joining strings together
We will use `paste()` and `paste0`. Default argument for separator between strings is a *single space*, but can be altered with `sep = ""`. The collapse argument is particularly useful for collapsing everything to a single string.
We could use this to generate a set of strings of variable names
```{r}
variable_names <- paste("variable", 1:5, sep="_")
print(variable_names)
length(variable_names)
```
We can do this to help automate writing out certain syntax (like for a model).Make a single string called "predictors" that looks like "x1 + x2 + x3 + x4 + x5" using `paste` in base `R`, or for `stringi` it is `stri_join()` and `str_c` in *stringr*. Before checking, what is the length of your new variables "predictors"?
```{r}
paste("x", 1:5, sep="", collapse = " + ")
```
### How long are your strings?
We are often given DNA (or RNA) fragments of varying lengths, and we want to look at the distribution of how long each sequence is. Below I am simulating some DNA sequences.
Check how many sequences we have
```{r, echo = F}
# generating how many sequences
seq_gen <- function() {
x <- sample(c("A", "C", "T", "G"),
size = rbinom(1, 100, 0.8),
replace = TRUE)
paste0(x, collapse="")
}
number_sequences <- rpois(1, 2000) # generating how many sequences in our sample
seqs <- replicate(number_sequences, seq_gen())
```
I have created an object of DNA sequences called "seqs" how long is each sequence? How many sequences are there? `nchar()` in base, and `str_length()`. Can you plot the distribution of the sequence lengths? Can you plot the distribution of %GC of each sequence as well?
```{r}
length(seqs)
nchar(seqs)
str_length(seqs)
stri_length(seqs)
```
histogram of lengths of the DNA sequences
```{r}
hist(nchar(seqs),
xlab = "length of DNA sequence",
ylab = "number of sequences")
```
#### CG% and plots of it.
We can use the `str_count` function
```{r}
C_count <- str_count(string = seqs, pattern = "C")
G_count <- str_count(string = seqs, pattern = "G")
CG_perc <- (C_count + G_count)/nchar(seqs)
CG_perc
```
```{r}
hist(CG_perc)
```
or from our function from last week (the one at the bottom of the script)
```{r}
BaseFrequencies <- function(x) {
# if it is a single string still
if (length(x) == 1 & mode(x) == "character") {
x <- strsplit(x, split = "", fixed = T)
x <- as.character(unlist(x))
}
if (mode(x) == "list") {
tab <- table(x)/lengths(x)}
else {
tab <- table(x)/length(x)
}
return(tab)
}
```
and then use `sapply` to run through all of the sequences. Note I had to use `t()` which is the transpose function so the matrix had 4 columns, and many rows.
```{r}
basefreq <- t(sapply(seqs, BaseFrequencies, USE.NAMES = F))
head(basefreq)
CG_perc2 <- rowSums(basefreq[,2:3])
```
## **regular expressions in R**
In general in R remember that since a backslash, '\'' has a special meaning when in a string, if you need to include it as part of a regular expression you need to escape it so to use it as part of a regular expression use "\\" so instead of "\d" for a digit you would use two consecutive backslashes followed by a d, i.e. "\\d". Same for other special characters that are part of regular expressions. i.e. "\\." or "\\^" or "\\$".
## Turning file names into variables (splitting strings)
While we already did this, Brian said I should remind you how important this is!!!!!
In our lab we use this a lot with our naming conventions for images that have names like ID_Experiment_Sex_Treatment_Vial_Individual.tiff. We use str_split or strsplit, to change these names to list which we can then use to generate the various factor levels for each variable.
`str_split()` in stringr, `strsplit()` in base R are useful. What pattern do you want to split the string on. Output is a list with vectors with new elements being strings split on the pattern for each element of the original input vector.
```{r}
file_labels <- c("ID_GeneKnockdown2018_ds-RNAi_M_1_1",
"ID_GeneKnockdown2018_ds-RNAi_F_1_1",
"ID_GeneKnockdown2018_control_M_1_1",
"ID_GeneKnockdown2018_control_F_1_1")
str_split(file_labels, pattern = "_")
```
If you know that each string going in will be split into the same number of elements (like the above example) you can set `simplify = TRUE` for `str_split()`, or if you know how columns you are going to break it into you can use `n = ` which is really useful for our purposes. This returns a matrix, which is most helpful.
```{r, eval = FALSE}
str_split(string, pattern = , simplify = T)
# or
str_split(string, pattern = , n = )
```
```{r, eval = FALSE}
# so we can
covariates_mat # your variable name for the string split
colnames(covariates_mat) <- c("lab_peep", "experiment", "treatment", "sex", "vial", "individual")
```
often you will need `pattern = fixed("YourPattern")` for this, just in case.