R_exercises/Bio720_AMorningofStrings_short.Rmd

---
title: "BIO720_morning_activity_strings"
author: "Ian Dworkin"
date: "`r format(Sys.time(),'%d %b %Y')`"
output: 
  html_document: 
    keep_md: yes
    toc: yes
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error =T)
```

# A lovely morning with strings

Ths morning we are going to work a bit with the string manipulation functions in R. R actually has many string manipulation functions, but they do not have a particularly standardized naming convention, nor do they have a common way of calling arguments (i.e. string first VS object you want to search etc...). So for some people this can be frustrating (it is actually not that bad). Some of this tutorial is based on an example from [here](https://francoismichonneau.net/2017/04/tidytext-origins-of-species/)

Given that, there are probably two very important R libraries to know about for string manipulation, *stringi* and *stringr*. There have been important co-developments of both these packages, but the two main things to know are that *stringr* just provides wrappers for functions in *stringi* (to make common uses a bit easier), and *stringr* has relatively limited functionality. They both also are consistent with *tidyverse* usage, as is one more library worth knowing about *tidytext*. There are a few others, like *rebus* which as a library uses tidyverse piping to make regular expressions easier to understand. However, I don't recommend this as much, as it is important to get used to writing regular expressions which are common among many programming languages.

All functions in *stringi* begin with `stri_..`, where ".." is the name of the function. For *stringr* all functions begin with `str_`.

## Install and load libraries you will need

You will want to install *stringi*, *stringr* and *tidytext*

```{r}
#install.packages("tidytext")
```

```{r}
library(stringr)
library(stringi)
```


## printing strings to the standard output (the console)

It turns out there are some subtleties that you want to know about with respect to printing things to the console. First thing to keep in mind is that "\" is the escape character so if you want to print the quote " This is "me"" to the console how would you do it (i.e. double quotes for everything). Compare your results using `print` vs `writeLines`

```{r}
print("this is \"me\"")
```


```{r}
writeLines("this is \"me\"")
```

also worth noting some of the differences in behaviours for vectors of strings.

```{r}
fly_seq <- c(A = "ACTGGCCA", B = "ACTGGCCT", C = "ACTGTCCA" )

print(fly_seq)
```


```{r}
writeLines(fly_seq, sep = " ")
```


```{r}
cat(fly_seq, sep ="\n")
cat(fly_seq)
```

How might you get `writeLines` to print each string on a newline?

```{r}
writeLines(fly_seq, sep ="\n")
```

## Joining strings together

We will use `paste()` and `paste0`. Default argument for separator between strings is a *single space*, but can be altered with `sep = ""`. The collapse argument is particularly useful for collapsing everything to a single string.

We could use this to generate a set of strings of variable names

```{r}
variable_names <- paste("variable", 1:5, sep="_")
print(variable_names)
length(variable_names)
```

We can do this to help automate writing out certain syntax (like for a model).Make a single string called "predictors" that looks like "x1 + x2 + x3 + x4 + x5" using `paste` in base `R`, or for  `stringi` it is `stri_join()` and `str_c` in *stringr*. Before checking, what is the length of your new variables "predictors"?

```{r}
paste("x", 1:5, sep="", collapse = " + ")
```

### How long are your strings?

We are often given DNA (or RNA) fragments of varying lengths, and we want to look at the distribution of how long each sequence is. Below I am simulating some DNA sequences. 

Check how many sequences we have 

```{r, echo = F}
# generating how many sequences
seq_gen <- function() {
   x <- sample(c("A", "C", "T", "G"), 
           size = rbinom(1, 100, 0.8),
           replace = TRUE)
   paste0(x, collapse="")    
}

number_sequences <- rpois(1, 2000) # generating how many sequences in our sample

seqs <- replicate(number_sequences, seq_gen())
```

I have created an object of DNA sequences called "seqs" how long is each sequence? How many sequences are there? `nchar()` in base, and `str_length()`. Can you plot the distribution of the sequence lengths? Can you plot the distribution of %GC of each sequence as well?

```{r}
length(seqs)
nchar(seqs)
str_length(seqs)
stri_length(seqs)
```

histogram of lengths of the DNA sequences
```{r}
hist(nchar(seqs),
     xlab = "length of DNA sequence",
     ylab = "number of sequences")
```


#### CG% and plots of it.

We can use the `str_count` function

```{r}
C_count <- str_count(string = seqs, pattern = "C")
G_count <- str_count(string = seqs, pattern = "G")

CG_perc <- (C_count + G_count)/nchar(seqs)

CG_perc
```


```{r}
hist(CG_perc)
```


or from our function from last week (the one at the bottom of the script)

```{r}
BaseFrequencies <- function(x) {
    
    # if it is a single string still
    if (length(x) == 1 & mode(x) == "character") {
    	x <- strsplit(x, split = "", fixed = T) 
        x <- as.character(unlist(x))
    }     
    
    if (mode(x) == "list") {
    	tab <- table(x)/lengths(x)}
    
    else {
    	tab <- table(x)/length(x)
    }	
   return(tab)
}
```

and then use `sapply` to run through all of the sequences. Note I had to use `t()` which is the transpose function so the matrix had 4 columns, and many rows.


```{r}
basefreq <- t(sapply(seqs, BaseFrequencies, USE.NAMES = F))

head(basefreq)

CG_perc2 <- rowSums(basefreq[,2:3])
```

## **regular expressions in R**

In general in R remember that since a backslash, '\'' has a special meaning when in a string, if you need to include it as part of a regular expression you need to escape it so to use it as part of a regular expression use "\\" so instead of "\d" for a digit you would use two consecutive backslashes followed by a d, i.e. "\\d". Same for other special characters that are part of regular expressions. i.e. "\\." or "\\^" or "\\$".  


## Turning file names into variables (splitting strings)

While we already did this, Brian said I should remind you how important this is!!!!!

In our lab we use this a lot with our naming conventions for images that have names like ID_Experiment_Sex_Treatment_Vial_Individual.tiff. We use str_split or strsplit, to change these names to list which we can then use to generate the various factor levels for each variable.

`str_split()` in stringr, `strsplit()` in base R are useful. What pattern do you want to split the string on. Output is a list with vectors with new elements being strings split on the pattern for each element of the original input vector.

```{r}
file_labels <- c("ID_GeneKnockdown2018_ds-RNAi_M_1_1", 
                 "ID_GeneKnockdown2018_ds-RNAi_F_1_1",
                 "ID_GeneKnockdown2018_control_M_1_1",
                 "ID_GeneKnockdown2018_control_F_1_1")

str_split(file_labels, pattern = "_")
```

If you know that each string going in will be split into the same number of elements (like the above example) you can set `simplify = TRUE` for `str_split()`, or if you know how columns you are going to break it into you can use `n = ` which is really useful for our purposes. This returns a matrix, which is most helpful. 

```{r, eval = FALSE}
str_split(string, pattern = , simplify = T)

# or
str_split(string, pattern = , n = )
```


```{r, eval = FALSE}
# so we can
covariates_mat # your variable name for the string split

colnames(covariates_mat) <- c("lab_peep", "experiment", "treatment", "sex", "vial", "individual")
```

often you will need `pattern = fixed("YourPattern")` for this, just in case.