2206905_file1.Rmd

---
title: "iPRG 2016 Study"
author: "The ABRF iPRG Research Group"
date: "August 16, 2016"
output: html_document
---
***
### Introduction

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

The following is a template and step-by-step example solution of the 2016 [iPRG](https://abrf.org/research-group/proteome-informatics-research-group-iprg) study as well as an example for how to submit and share such solutions in an R Markdown document.

Participants are free to analyze the data in any way they see fit, using either or both MS1 and MS2 data, as long as the results are presented in a data matrix of the format, row and column names as defined below. In this example, we use the [Trans-Proteomic Pipeline](http://tools.proteomecenter.org/wiki/index.php?title=Software:TPP) and assume a standard installation of this pipeline is available. This is not an endorsement or recommendation to use this particular software. Partipants are also allowed to submit their results in a spreadsheet, as in previous studies, or as a [Python](https://www.python.org/) script.

In iPRG studies, it is important all participants start from the same data and use the sequence databases provided by the iPRG. The raw data files are included in the study package that can be downloaded from [iprg2016.org](http://iprg2016.org).

***

### Methods

#### 1. Load package 'XML' and specify the input

Here we load the required R package(s), set the working directory and specifiy which sequence database and search engine to use.

```{r, message=FALSE}
require(XML)
setwd('C:/Inetpub/wwwroot/ISB/data/iPRG2016')             # where you want to download the data
SEARCH_ENGINE <- 'comet'                                  # your locally installed search engine
SEARCH_ENGINE_PARAMETERS <- 'comet_5ppm.params'           # your search parameters
FASTA_FILE <- 'iPRG2016.fasta'                            # the (downloaded) sequence database
```

The particular parameter file used above can be downloaded [here](http://iprg2016.org). This should not be seen as an endorsement of a particlular search engine or settings, in this study or otherwise.

***

#### 2. Convert raw data to mzML

Although study participants are allowed to use any software they wish to from this point forward, we suggest that if converting the raw files to mzML, this is done by calling msconvert. The parameters can be modified depending on the search engine and whether MS1 data is used.

```{r, eval=FALSE}
raw_files <- list.files(pattern = 'raw$')
for(x in raw_files) system(paste('msconvert', x, '--32 --zlib --filter \"peakPicking true 2-\" --filter \"msLevel 2\"'))
```
***

#### 3. Perform database search

Here we simply call the defined search engine with the defined parameters on each data file separately.

```{r, eval=FALSE}
mzML_files <- list.files(pattern = 'mzML$')
for(x in mzML_files) system(paste(SEARCH_ENGINE, ' -P', SEARCH_ENGINE_PARAMETERS, ' -N', x, ' -D', FASTA_FILE, ' ', x, sep = ''))

```
***

#### 4. Validate search results with PeptideProphet and ProteinProphet

Here we run PeptideProphet and ProteinProphet with one command, using the -Op flag. Again, we do this on each file separately.

```{r, eval=FALSE}
pepXML_files <- list.files(pattern = '[^to].pep.xml$')    # list only original pepXML files
for(x in pepXML_files) system(paste('xinteract -N', x, '.interact.pep.xml -p0.05 -l7 -PPM -Op ', x, sep = ''))
```
***

#### 5. Collect all ProteinProphet results

Next, we read in all the results from the command line calls into R using the [XML](https://cran.r-project.org/web/packages/XML/index.html) package. Another possibility would be to use the [mzR](http://bioconductor.org/packages/release/bioc/html/mzR.html) parser from the [Bioconductor](http://bioconductor.org/) package. 

```{r}
ProteinProphet_files <- list.files(pattern='interact.prot.xml$')
protein <- c()
for(x in ProteinProphet_files) 
  {
   doc <- xmlToList(x)
   for (i in seq(3,length(doc)-1)) protein <- c(protein,doc[i]$protein_group$protein$.attrs['protein_name'])
 }
protein <- unique(protein)                                # protein is now a vector of all proteoforms
probability <- list()
for(x in ProteinProphet_files) 
  {
  doc <- xmlToList(x)
  for (p in protein) probability[[x]][[p]] <- 0
  for (i in seq(3,length(doc)-1)) probability[[x]][[doc[i]$protein_group$protein$.attrs['protein_name']]] <- doc[i]$protein_group$protein$.attrs['probability']
 }
```
***

### Results

#### 6. Export results for study submission

We place the results in an R probability matrix with the sequences (proteoforms) as rows and samples as columns. The probabilities are the local false discovery rates calculated by ProteinProphet, applying parsimony, also known as ["Occam's razor"](https://en.wikipedia.org/wiki/Occam's_razor).

```{r}
DF <- as.data.frame(sapply(probability, rbind), rownames <- protein)
DM <- as.matrix(DF[,1:12])
class(DM)='numeric' 
colnames(DM) <- unlist(strsplit(colnames(DF),".mzML"))[seq(1,23,2)]
```


Finally, we extract and export the PrEST results to a tab-delimted file, ready for submission on [iprg2016.org](http://iprg2016.org):

```{r}
PrESTs<-DM[grep("HPRR", rownames(DM)),]
PrESTs<-PrESTs[order(rownames(PrESTs)),]
write.table(data.frame("PEP"=rownames(PrESTs),PrESTs), file='my_iPRG2016_results.txt', row.names=FALSE, sep='\t')
```

***

### Summary

This example solution (1) shows how the iPRG 2016 challenge can be addressed and (2) demonstrates how an entire solution can be integrated into an R Markdown document. The example is not meant to convey the *best* solution and should only be used as an example or a template for study submissions.