talk08.Rmd

---
title: "R for bioinformatics, data iteration & parallel computing"
subtitle: "HUST Bioinformatics course series"
author: "Wei-Hua Chen (CC BY-NC 4.0)"
institute: "HUST, China"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  beamer_presentation:
    theme: AnnArbor
    colortheme: beaver
    fonttheme: structurebold
    highlight: tango
    includes:
      in_header: mystyle.sty
---

```{r include=FALSE}
color_block = function(color) {
  function(x, options) sprintf('\\color{%s}\\begin{verbatim}%s\\end{verbatim}',
                               color, x)
}

## 将错误信息用红色字体显示
knitr::knit_hooks$set(error = color_block('red'))
```


# section 1: TOC

## 前情提要

``` stringr ```, ``` stringi ``` and other string packages ...

1. basics
 * length
 * uppercase, lowercase
 * unite, separate
 * string comparisons, sub string
2. regular expression

## 本次提要

* for loop
* ``` apply ``` functions 
* ``` dplyr ``` 的本质是 遍历 
* ``` map ``` functions in ```purrr ``` package 
* 遍历 与 并行计算

# section 2: iteration basics 

## for loop , get data ready

\FontSmall

```{r message=FALSE, warning=FALSE}
library(tidyverse);
## create a tibble 
df <- tibble( a = rnorm(100), b = rnorm(100), c = rnorm(100), d = rnorm(100) );

head(df, n = 3);
```


## see for loop in action 

\FontSmall 

```{r message=FALSE, warning=FALSE}
## 计算 row means 
res1 <- vector( "double", nrow(df) );
for(  row_idx in 1:nrow( df ) ){
  res1[row_idx] <- mean( as.numeric( df[row_idx , ] ) );
}

res2 <- c();
for(  row_idx in 1:nrow( df ) ){
  res2[ length(res2) + 1 ] <- mean( as.numeric( df[row_idx , ] ) );
}

## 计算 column means 
res2 <- vector( "double", ncol(df) );
for(  col_idx in 1:ncol( df ) ){
  res2[col_idx] <- mean( df[[col_idx]] );
}
```

## for loop 的替代

由于运行效率可能比较低，尽量使用 for loop 的替代

\FontSmall

```{r}
rowMeans( df );
colMeans( df );
```

\FontNormal 

类似的函数还有：

\FontSmall

```{r eval=FALSE}
rowSums( df );
colSums( df );
```

## ```apply``` 相关函数

\FontSmall

Usage: 

``` apply(X, MARGIN, FUN, ...) ```; 

MARGIN : 1 = 行， 2 = 列； c(1,2) = 行&列

FUN : 函数，可以是系统自带，也可以自己写

```{r}
df %>% apply( ., 1, median ); ## 取行的 median 
df %>% apply( ., 2, median ); ## 取列的 median 
```

\FontNormal

**问题** ``` df %>% apply( ., c(1,2), median ); ``` ## 取both的 median  会发生什么？？

## ```apply``` 与自定义函数配合

\FontSmall

```{r}
df %>% apply( ., 2, function(x) { 
    return( c( n = length(x), mean = mean(x), median = median(x) ) );
  } ); ## 列的一些统计结果
```

\FontNormal 

**注意** 行操作大部分可以被 ``` dplyr ``` 代替

## ``` tapply ``` 的使用

以行为基础的操作，用法：

``` tapply(X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE) ```

用 **index**   将   **x**   分组后，用    **fun**  进行计算 -> 
用 **姓名**    将 **成绩**  分组后，计算  **平均值** 
用 **汽缸数**  将 **油耗**  分组后，计算  **平均值** 

\FontSmall 

```{r warning=FALSE, message=FALSE}
library(magrittr);
## 注意 pipe 操作符的使用
mtcars %$% tapply( mpg, cyl, mean ); ## 汽缸数 与 每加仑汽油行驶里程 的关系
```

## ``` tapply ``` versus ``` dplyr ```

然而，使用``` dplyr ``` 思路会更清晰

\FontSmall 

```{r warning=FALSE, message=FALSE, eval=FALSE}
mtcars %>% group_by( cyl ) %>% summarise( mean = mean( mpg ) );
```

\FontNormal 

**注意** ``` tapply ``` 和 ``` dplyr ``` 都是基于行的操作！！ 

## ``` lapply ``` 和 ``` sapply ``` 

基于**列**的操作

输入：

* vector ： 每次取一个 element 
* data.frame, tibble, matrix : 每次取一列
* list ： 每次取一个成员 

## ``` lapply ``` 和 ``` sapply ``` , cont. 

输入是 tibble 

\FontSmall 

```{r}
df %>% lapply( mean );
df %>% sapply( mean );
```

##  ``` lapply ``` 和 ``` sapply ``` , cont. 

输入是 list ，使用自定义函数

\FontSmall
```{r}
list( a = 1:10, b = letters[1:5], c = LETTERS[1:8] ) %>% 
  sapply( function(x) { length(x) } );
```

\FontNormal

**强调** 

* ``` lapply ``` 是针对列的操作 
* 输入是 tibble, matrix, data.frame 时，功能与 ``` apply( x, 2, FUN ) ``` 类似 ... 

# section 3: iteration 进阶： the ```purrr``` package  

##  ``` map ``` ， RStudio 提供的 ``` lappy ``` 替代

![来自 ``` purrr ``` package](images/talk08/purrr.png){height=20%}

* part of ```tidyverse ```

## ```purrr``` 的基本函数

\FontSmall 

``` map( FUN ) ``` : 
1. 遍历每列（tibble）或 slot （list），
2. 运行 FUN 函数，
3. 将计算结果返回至 list 

对应： ``` lapply ``` 

```{r}
df %>% map( summary );
```

## 对应 ``` sapply ``` 的 ``` map_ ``` 函数

* ```map_lgl()``` makes a logical vector.
* ```map_int()``` makes an integer vector.
* ```map_dbl()``` makes a double vector.
* ```map_chr()``` makes a character vector.

\FontSmall 

```{r}
df %>% map_dbl( mean ); ## 注：返回值只能是单个 double 值
```

**？？** 以下代码运行结果会是什么？？？ 
```{r eval=FALSE}
df %>% map_dbl( summary ); 
df %>% sapply( summary ); 
```

## ``` map ``` 的高阶应用 

为每一个汽缸分类计算： 燃油效率与吨位的关系

\FontSmall 

```{r fig.height=4, fig.width=10}
plt1 <- 
  mtcars %>% 
  ggplot( aes( mpg, wt ) ) + 
  geom_point(  ) + facet_wrap( ~ cyl );
```

## 取得线性关联关系

\FontSmall 

```{r fig.width=10, fig.height=4}
plt1;
mtcars %>% split( .$cyl ) %>% map( ~ cor.test( .$wt, .$mpg ) ) %>% map_dbl( ~.$estimate );
```


## 命令详解

\FontSmall

```{r eval=FALSE}
mtcars %>% split( .$cyl ) %>% map( ~ cor.test( .$wt, .$mpg ) ) %>% map_dbl( ~.$estimate );
```

\FontNormal

1. ```split( .$cyl )``` : 由 ``` purrr```  提供的函数，将mtcars 按 cyl 列分为三个 tibble，返回值存入 list 

注意： ```. ``` 在 pipe 中代表从上游传递而来的数据；在某些函数中，比如 ```cor.test()``` ，必须指定输入数据，可以用 ```. ``` 代替。

请测试以下代码，查看 ``` split ``` 与  ``` group_by ``` 的区别

\FontSmall 
```{r eval=FALSE}
mtcars %>% split( .$cyl );
mtcars %>% group_by( cyl );
```

## 命令详解, cont. 

\FontSmall

```{r eval=FALSE}
mtcars %>% split( .$cyl ) %>% map( ~ cor.test( .$wt, .$mpg ) ) %>% map_dbl( ~.$estimate );
```

\FontNormal 

``` map ``` ： 遍历上游传来的数据 （list），对每个成分（ list 或 列）运行函数： ```  ~ cor.test( .$wt, .$mpg )  ``` 

注意

1. 这里的 ``` cor.test ``` 应该有两种写法：

\FontNormal

```{r eval=FALSE}
## 正规写法：
map( function(df) { cor.test( df$wt, df$mpg ) } )
## 简写：
map( ~ cor.test( .$wt, .$mpg ) )
```

\FontNormal

2. ``` ~ ```  的用法 : 用于取代 ``` function(df) ```

## 命令详解, cont. 

``` map ``` 也可以进行数值提取操作： ``` map_dbl( ~.$estimate ) ``` 

上述命令同样有两种写法：

\FontSmall

```{r eval=FALSE}
## 完整版
map_dbl( function(eq) { eq$estimate} );
## 简写版
map_dbl( ~.$estimate )
```

## more to read & exercise 

* `map`: apply a function to **each element** of **a list**, return **a list**

* `map2`: apply a function to **a pair of elements**, return **a list**

* `pmap`: apply a function to groups of elements from **a list of lists or vectors**, return  **a list**

* `imap`: ... 

* more to read and exercise about iterations: https://r4ds.had.co.nz/iteration.html
  * filter
  * index
  * Modify
  * reshape
  * combine 
  * reduce
  
* find more exercise at the end of the slides

## `reduce`

\FontSmall

```{r}
dfs <- list(
  age = tibble(name = "John", age = 30),
  sex = tibble(name = c("John", "Mary"), sex = c("M", "F")),
  trt = tibble(name = "Mary", treatment = "A")
)

dfs %>% reduce(full_join)
```

## `reduce`, cont. 

\FontSmall

```{r}
vs <- list(
  c(1, 3, 5, 6, 10),
  c(1, 2, 3, 7, 8, 10),
  c(1, 2, 3, 4, 8, 9, 10)
)

vs %>% reduce(intersect)
```

## `accumulate`

\FontSmall

```{r}
( x <- sample(10) );
x %>% accumulate(`+`);
```

# section 4: 并行计算 

## 并行计算介绍

并行计算一般需要3个步骤：

1. 分解并发放任务
2. 分别计算
3. 回收结果并保存

## 相关的包

```parallel``` 包：检测CPU数量；

```doParallel```包：将全部或部分 分配给任务

``` foreach ``` 包： 提供 ``` %do% ``` 和 ``` %dopar% ``` 操作符，以提交任务，进行顺序或并行计算 

辅助包：

``` iterators ``` 包： 将 data.frame, tibble, matrix 分割为行/列 用于提交并行任务。

**注意** 任务完成后，要回收分配的 CPU core。

首先安装相关包（一次完成）。

\FontSmall

```{r eval=FALSE}
install.packages("doParallel");
install.packages( "foreach" ); ## 会自动安装 iterators 
```

## 简单示例

\FontSmall
```{r}
library(doParallel); ## 
library(foreach);
library(iterators);

## 检测有多少个 CPU --
( cpus <- parallel::detectCores() );

registerDoParallel( cpus - 1 );

## Basic loop
data(iris)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000

start <- proc.time()
outdf <- data.frame()
r <- for (i in 1:trials){
  ind <- sample(100, 100, replace=TRUE)
  result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
  outdf <- rbind(outdf,coefficients(result1))
}

base_loop <- proc.time()-start

```

## 简单示例，cont.

\FontSmall

`%do%` loop - foreach notation, but not parallel

```{r}
start <- proc.time()
r <- foreach(icount(trials), .combine=rbind) %do% {
  ind <- sample(100, 100, replace=TRUE)
  result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
  coefficients(result1)
}
do_loop <- proc.time()-start
```

## 简单示例，cont.

\FontSmall

`%dopar%` adds parallelization

```{r}
start <- proc.time()
r <- foreach(icount(trials), .combine=rbind) %dopar% {
  ind <- sample(100, 100, replace=TRUE)
  result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
  coefficients(result1)
}
dopar_loop <- proc.time()-start
```

## 简单示例，cont.

\FontSmall

结果比较：

```{r}
print(rbind(base_loop,do_loop,dopar_loop)[,1:3])
```


## 命令详解 

``` .combine = 'c' ``` 参数的可能值：

* 'c' : 将返回值合并为 vector ；当返回值是单个数字或字符串的时候使用
* 'cbind' : 将返回值按列合并
* 'rbind' : 将返回值按行合并
* 默认情况下返回 ``` list ``` 


## 数据分发练习

将下面的计算转为并行计算

\FontSmall 

```{r eval=FALSE}
mtcars %>% split( .$cyl ) %>% map( ~ cor.test( .$wt, .$mpg ) ) %>% map_dbl( ~.$estimate );
``` 

```{r eval=FALSE}
## make a cluster --
cl2 <- makeCluster( cpus - 1 );
registerDoParallel(cl2);

## 分配任务 ... 
res2 <- foreach( df = iter(  mtcars %>% split( .$cyl )  ), .combine = 'rbind' ) %dopar% {
  cor.res <- cor.test( df$wt, df$mpg );
  return ( c( cor = cor.res$estimate, p = cor.res$p.value )  ); ## 注意这里的返回值是 
}

res3 <- foreach( df = iter(  mtcars %>% split( .$cyl )  ), .packages = c("ggplot2") ) %dopar% {
  p <- ggplot(df, aes( x = wt, y = mpg )) + geom_point();
  return ( p ); ## 注意这里的返回值是
}

## 注意在最后关闭创建的 cluster 
stopCluster( cl2 );

res2;
```

## 练习详解

1. ``` df = iter(  mtcars %>% split( .$cyl )  ) ``` ： mtcars 按汽缸数分割为3个 list，依次赋予 df ；
2. ``` cor.res <- cor.test( df$wt, df$mpg ); ``` ：计算每个df 中 wt 与 mpg 的关联，将结果保存在 ``` cor.res ``` 变量中；
3. ``` .combine = 'rbind' ``` ： 由于返回值是 vector ， 用此命令按行合并；

## ``` foreach ``` 的其它参数

``` .packages=NULL ``` ： 将需要的包传递给任务。如果每个任务需要提前装入某些包，可以此方法。比如：

\FontSmall

```{r eval=FALSE}
.packages=c("tidyverse") 
```

## 嵌套 (nested) foreach 

有些情况下需要用到嵌套循环，使用以下语法：

\FontSmall

```{r eval=FALSE}
foreach( ... ) %:% {
  foreach( ... ) %dopar% {
    
  }
}

```

\FontNormal

即：外层的循环部分用 %:% 操作符

## 其它并行计算函数

``` parallel ``` 包本身也提供了 ``` lapply ``` 等函数的并行计算版本，包括：

* ``` parLapply ```
* ``` parSapply ```
* ``` parRapply ```
* ``` parCapply ```

## ``` parLapply ``` 举例

任务： 计算 2 的 N 次方：

\FontSmall

```{r eval=FALSE}
cl<-makeCluster(3);
parLapply(cl, 
          2:4, 
          function(exponent) 
            2^exponent);
stopCluster(cl);
```

\FontNormal

其它的函数这里就不一一介绍了

# section 5: 小结及作业！

## 本次小结

### iterations 与 并行计算

* for loop
* ``` apply ``` functions 
* ``` dplyr ``` 的本质是 遍历 
* ``` map ``` functions in ```purrr ``` package 
* 遍历 与 并行计算

### 相关包

* purrr
* parallel
* foreach
* iterators

## 下次预告

data visualizations

* basic plot functions
* basic ggplot2 
* special letters
* equations 

## 作业

-   ```Exercises and homework``` 目录下 ```talk08-homework.Rmd``` 文件；

- 完成时间：见钉群的要求