Reuseable Wiener.Rmd

---
title: "Assignment 2 MSB1015 Making Multivariate Statistics Reuseable"
author: "Caroline Collins 6192527"
date: "3 October 2019"
output: html_document
---
##Note 
As a prerequisite, this notebook needs a certain Wikidata query R package which you can find here: https://github.com/bearloga/WikidataQueryServiceR

##Project Synopsis

Before running a statistical analysis, we began with some WikiData data that needed to be completed, referenced, inspected, processed and cleaned.
After this came the analysis and modelling, and, finally, visualisation of the
results. 
The project process taught us that there are many choices to be made about methods, parameters and variables.
I aimed to document this process in the right level of detail to render the project reproducible. This includes justifying many of the modelling choices. 
The following interactive notebook is an explanation of the steps I performed.

##Computational Chemistry Background

In 1947 Harry Wiener showed a correlation between the physicochemical properties of organic compounds and their chemical structure.
He made a correlation model that linked 
structural features of compounds
with boiling points. 


##Brief Description of the Assignment

In this assignment we use **SPARQL** to query physicochemical properties from Wikidata for a series of chemical properties,
then make a (PLS) Partial Least Squares model 
in an R Markdown notebook with RStudio. 
In order to correctly use the model for prediction it is necessary to split
the dataset into a training and a test set.
(using 10% or 20% as test set and the remaining for training) 
and also to use cross-validation 
(e.g. Leave-one-Out or Leave-10%-Out) 
An elbow plot will help to estimate the optimal number of latent variables to use in the predictive model. 
A scatterplot of the model predictions for the test set of the dependent variable - boiling point (the modelled property) versus the experimental (real, observed) value will serve as a visualisation of the accuracy of the model predictions, quantified as an RMSEP in units Kelvin.
This assignment is part of weeks 4 and 5 of Scientific Programming on the Master's Systems Biology at Maastricht University.  [https://www.maastrichtuniversity.nl/education/master/master-systems-biology ]

##Useful sources

- Pharmacology: Rang & Dale, ISBN
9780702053627, in that edition Chapter 7 en 60 are of interest
- For general knowledge about QSAR, e.g.:
Varnek & Tropsha, Chemoinformatics approaches to virtual screening
- Ch. 6 of Kahnâs Recent Trends on QSAR in the Pharmaeutical Perceptions
(ebook)
- Wiener H. Structural Determination of Paraffin Boiling Points. Journal of the
American Chemical Society. 1947 Jan;69(1):17-20.
- R Markdown: [ https://rmarkdown.rstudio.com/ ]
- PLS package for R:[ https://cran.r-project.org/web/packages/pls/]
- Vignette: [ https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf]
- rcdk package for R Vignette/tutorial :[ https://cran.r-project.org/web/packages/rcdk/vignettes/using-rcdk.html]

- The WikiData SPARQL endpoint is [https://query.wikidata.org/bigdata/namespace/wdq/sparql ]

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

##STEP 0 SETUP
Set up libraries required for the assignment

```{r}
library("rcdk")
library(pls)
library(WikidataQueryServiceR)
```

##STEP 1 COLLECT and PREPARE DATA
Using the R package WikidataQueryServiceR. 
Maintaining the SPARQL query syntax.
The first query checks for any entries still waiting for their  boiling point statements to be entered in WikiData.

###WikiData objects and properties used here
- P31 is an instance of
- P279 is is a subclass of
- Q41581 an alkane is an acyclic saturated hydrocarbon
- P2102 is the boiling point
- P233 is the canonical SMILES

###Notes on SMILES
is a text string that represents the compound structure in 2D. ie. it specifies elements, connectivity and bond order, but not the 3D position of each atom.
Examples of SMILES:
-ethanol CCO; 
-ethylene C=C; 
-acetylene C#C; 
-2-propanol CC(O)C
Query WikiData to find if there are many molecules with missing boiling point data.
```{r}
sparql_query_missingbp <- 'SELECT DISTINCT ?alkane ?alkaneLabel ?SMILES ?boilingpoint
WHERE {
    ?alkane wdt:P31/wdt:P279* wd:Q41581 .
    ?alkane wdt:P233 ?SMILES .
    MINUS {?alkane wdt:P2102 ?boilingpoint .}
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en"}
}'
results_missingbp = query_wikidata(sparql_query_missingbp)
```
During the period of working on the assignment, this number of molecules missing their boiling point data in WikiData fell from 11 to 8.
```{r}
sparql_query <- 'SELECT DISTINCT ?alkane ?alkaneLabel ?boilingpoint 
WHERE {
    ?alkane wdt:P31/wdt:P279* wd:Q41581 .
    
    ?alkane wdt:P2102 ?boilingpoint .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en"}
}'
results_first = query_wikidata(sparql_query)
```
first eyeball of the data : 131 unique / distinct observations are returned (140 non-unique) Update: 2 October. there are now 134. 
```{r}
range(results_first$boilingpoint)
```
In the first week of the project this returned a range -44.00 to 994.15 - units unknown.
That range has a suspiciously low value. In Kelvin expect all values non-negative!
Let's check out the units of the boiling points and 
convert anything non-Kelvin to Kelvin.

###Convert units
The following query is altered from the one provided by Egon  "set up to deal with the complex structure of statements with units in Wikidata." ie it reaches right down into the hierarchical structure of the WikiData  data to extract the various units entered by different WikiData users, and also a human readable label for those units.
```{r}
units_query <- 'SELECT DISTINCT ?alkane ?alkaneLabel ?SMILES ?boilingpoint ?bpUnit ?bpUnitLabel WHERE {
  ?alkane wdt:P31/wdt:P279* wd:Q41581 .
  ?alkane wdt:P233 ?SMILES ;
        p:P2102 [
          ps:P2102 ?boilingpoint ;
          psv:P2102/wikibase:quantityUnit  ?bpUnit
        ] .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}'
results_bp = query_wikidata(units_query)
possible_units <- unique(results_bp$bpUnitLabel)
possible_units
```

getting the boiling points out of the WikiData DB
the problem is that some of them are in Celsius and others in Fahrenheit.
Convert units and change unitlabels across the dataset.
```{r}

results_bp$boilingpoint[results_bp$bpUnitLabel == "degree Celsius"] <- results_bp$boilingpoint[results_bp$bpUnitLabel == "degree Celsius"] + 273.15
results_bp$bpUnitLabel[results_bp$bpUnitLabel == "degree Celsius"] <- "kelvin"
results_bp$boilingpoint[results_bp$bpUnitLabel == "degree Fahrenheit"] <- (results_bp$boilingpoint[results_bp$bpUnitLabel == "degree Fahrenheit"]-32)*5/9 + 273.15
results_bp$bpUnitLabel[results_bp$bpUnitLabel == "degree Fahrenheit"] <- "kelvin"
```
quick check
```{r}
unique(results_bp$bpUnitLabel)
```
```{r}
head(results_bp$boilingpoint)
```

##Eyeball the data
```{r}
range(results_bp$boilingpoint)
```
```{r}
plot.default(results_bp$boilingpoint)
```
  
  
  Check the distribution of the input data
```{r}
hist(results_bp$boilingpoint)
```
  
  
  The boiling points are not normally distributed.
It will be important to check the distribution of the training set later to verify that it is not biased or skewed.


Chemistry beginner's question:
Does the length of the SMILES string correlate with the boiling point?
```{r}
plot(nchar(results_bp$SMILES),results_bp$boilingpoint)
```
 
 
  The correlation between the boiling point and the string length of the SMILES looks good!!
```{r}
cor(nchar(results_bp$SMILES),results_bp$boilingpoint)
```

##Collect variables which we expect to be informative of boiling point
Translate the data that we have extracted, in the form of SMILES, into what we need out of the CDK database in order to do the regression. For this use R package rcdk.
This will give a collection of descriptors
which can be used as variables,
following that we will then generate latent variables with PLS.

##Get Molecule Descriptors with rcdk
A key requirement for the predictive modeling of molecular properties and activities are molecular descriptors - numerical charaterizations of the molecular structure. 
The CDK implements a variety of molecular descriptors, categorized.
rcdk  allows us to access the cheminformatics functionality of the CDK from within R.
There is a list of descriptors here https://cdk.github.io/cdk/1.5/docs/api/org/openscience/cdk/qsar/descriptors/molecular/package-tree.html
I can also learn about descriptors in the rcdk documentation.
https://cran.r-project.org/web/packages/rcdk/vignettes/using-rcdk.html

##Explore rcdk
The molecule objects (that result from load.molecules) are of class jobjRef (provided by the rJava package). As a result,they are pretty opaque to the user and are really meant to be processed using methods from the rcdk or rJava packages.

Another common way to obtain molecule objects is by parsing SMILES strings. The simplest way to do this is eg
```{r}
smile <- 'c1ccccc1CC(=O)C(N)CC1CCCCOC1'
mol <- parse.smiles(smile)[[1]]
```
Usage is more efficient when multiple SMILE are supplied, since then a single SMILES parser object is used to parse all the supplied SMILES. (from the rcdk tutorial -see useful sources)
It is possible to evaluate all available descriptors at one go, or evaluate individual descriptors.
Each descriptor name (eg "org.openscience.cdk.qsar.descriptors.molecular.WHIMDescriptor") is  a fully qualified Java class name for the corresponding descriptor. 
These names can be supplied to eval.desc to evaluate a single or multiple descriptors for one or more molecules.

Explore the descriptor categories
```{r}
dc <- get.desc.categories()
dc
```

###Choosing suitable variables for the predictive model
Here there are two opposing approaches to building the model:
purely data-driven throw the kitchen sink at it, uses all the available descriptors from CDK (how many datapoints is that? - 219 after cleaning), followed by dimension reduction and then predicts based on the optimum number of latent variables. (bp_model_2)
vs a more knowledge-based approach (bp_model_1). By eg. searching the descriptor list on "Weiner", "boiling point", "complexity" and handpicking likely important variables (this is the approach used in the rcdk vignette). The r pls package PLS method still employs principal component analysis, so our prediction will still run on latent variables anyway.

###Generate a set of descriptors for a very simple model###
The list of descriptor names used here comes from the rcdk tutorial (https://cran.r-project.org/web/packages/rcdk/vignettes/using-rcdk.html) the example in which the are also interested in predicting boiling points from organic molecules.
```{r}
mols <- parse.smiles(results_bp$SMILES)
descNames_limited <- c(
 'org.openscience.cdk.qsar.descriptors.molecular.KierHallSmartsDescriptor',
 'org.openscience.cdk.qsar.descriptors.molecular.APolDescriptor',
 'org.openscience.cdk.qsar.descriptors.molecular.HBondDonorCountDescriptor')
descriptors_limited <- eval.desc(mols, descNames_limited)
```
Note: this gives 81 variables across the molecules list. Some will be NA, nevertheless 81 variables from 3 descriptors is a lot more than I expected. Prompting the question how many variables would there be from the complete set of around 50 descriptors? (answer: more than 200)
```{r}
anyNA(descriptors_limited)
```
Confirms that , as mentioned in the tutorial, these descriptors have been chosen so as to not return NAs.
Nevertheless, this code, preparatory to building a predictive model from the descriptor matrix, is set up to remove columns with NAs and 
any constant columns from the matrix.
```{r}
#remove NAs
descriptors_limited <- descriptors_limited[, !apply(descriptors_limited, 2, function(x) any(is.na(x)) )]
#remove constant columns
descriptors_limited <- descriptors_limited[, !apply( descriptors_limited, 2, function(x) length(unique(x)) == 1 )] #no. of variables drops from 81 to 5 on running this

```


Add boiling point to this descriptor matrix, this will enable us to use the formula syntax in plsr to build our model.
```{r}
descriptor_matrix_1 <- cbind(descriptors_limited, results_bp$boilingpoint)
library(plyr)
descriptor_matrix_1 <- rename(descriptor_matrix_1,c("results_bp$boilingpoint" = "boilingpoint"))       #renames awkward column name (requires dplyr package)
```


##STEP 2 MULTIVARIATE STATISTICS Partial Least Squares (PLS) with plsr

###Split dataset into completely independent, randomly sampled Training and Test datasets
make a random 80/20 split
```{r}
number <- nrow(descriptor_matrix_1)
indexes <- c(1:number)
sample_indexes <- sample(indexes, size = (number*0.2))
bp_train <- descriptor_matrix_1[-sample_indexes,]
bp_test <- descriptor_matrix_1[sample_indexes,]
```

###Verify that the ditribution of the training data is not skewed.

```{r}
hist(bp_train$boilingpoint)
```


Check the distribution of the boiling points of the training dataset on a histogram.
Satisfactory: the training set has a similar distribution to the full set of alkanes.

###Make first model

There are 5 descriptor variables
Cross validation and PLS results, displayed on an elbow plot, suggest that 1 latent variable is optimum for this model.
```{r}


bp_model_1 <- plsr(boilingpoint ~. ,ncomp = 3, data = bp_train, validation = "CV")
summary(bp_model_1)
```
```{r}
plot(RMSEP(bp_model_1), legendpos = "topright")
```


###What's the predictive power of this very basic model with a single latent variable?
Predict on the test set
```{r}
pred_bp_1 <- predict(bp_model_1, ncomp = 1, newdata = bp_test)
plot(pred_bp_1,bp_test$boilingpoint, abline(a= 0, b = 1))
```

```{r}
RMSEP(bp_model_1, newdata = bp_test)
```
This is not a satisfactory model. With a single latent variables RMSEP is somewhere between 90 and 100 Kelvin.
It does not perform very well at predicting the boiling points of compounds in the training set. This choice of descriptors appears not to be sufficiently predictive.

##Model 2 A data-driven model (bp_model_2)

###Get all the descriptors possible from CDK

```{r}
descNames <- unique(unlist(sapply(get.desc.categories(), get.desc.names))) #50 descriptors
dn_all <- eval.desc(mols, descNames) #286 variables
```

remove NA - containing columns as before

```{r}
anyNA(dn_all)
#remove NAs
desc_all <- dn_all[, !apply(dn_all, 2, function(x) any(is.na(x)) )]
#results in 219 variables (before was 286 variables)
```

and create one big decriptor_matrix_2 containing the boiling point

```{r}

#Add boiling point to this to make the descriptor matrix
descriptor_matrix_2 <-  cbind(desc_all, results_bp$boilingpoint)
#note rename() needs library(plyr)
descriptor_matrix_2 <- rename(descriptor_matrix_2,c("results_bp$boilingpoint" = "boilingpoint"))       #renames awkward column name (uses dplyr package)
```
Split the training and test sets.

```{r}
bp_train_2 <- descriptor_matrix_2[-(sample_indexes),]
bp_test_2 <- descriptor_matrix_2[sample_indexes,]
```
Make and cross-validate the model using plsr

```{r}
bp_model_2 <- plsr(boilingpoint ~., data = bp_train_2, ncomp = 12, validation = "CV" )
summary(bp_model_2)
```
Select the optimum number of latent variables

```{r}
plot(RMSEP(bp_model_2), legendpos = "topright")
```


Based on the elbow plot , a model with 3 latent variables appears optimum with good performance without adding  too much complexity. What's the predictive power of this data-driven model with 3 latent variables?

##STEP 3 VISUALISATION 

Predict boiling points of the test set and generate a
Scatterplot of Predicted boiling pt vs Observed boiling point in Test dataset

```{r}
pred_bp_2 <- predict(bp_model_2, ncomp = 3, newdata = bp_test_2)
plot(pred_bp_2,bp_test_2$boilingpoint, abline(a= 0, b = 1))

```


This model is more successful than bp_model_1, predicting reasonably well.

##STEP 4 CONCLUSIONS

###Report prediction errors 
```{r}
RMSEP(bp_model_2, newdata = bp_test_2)
```

It is interesting to note that, based on the earlier elbow plot depicting the results of the cross-validation RMSEP's from the training set alone,  the choice of 2, 3,4,5 or 6 latent variables for the model looked optimum, with the final model choice of 3 latent variables meeting the criterion 'As simple as possible, but not simpler than really needed', Whereas,the error measurement at 10 latent variables (ncomp = 10) gave the lowest actual error on the first particular set of independent randomly sampled test data that I ran, and on another run, was still decreasing at ncomp = 12. That number and the RMSEP will vary depending on the particular test set, and considering the multimodal distribution of the boiling points and the small (c.100) number of distinct molecules used to build the model, it can vary considerably.

###Comparison of correlations
At the start of the project, I observed that between SMILES string length and boiling point the correlation was 0.93...
Let us compare that to the correlation of bp_model_1
```{r}
cor(pred_bp_1,bp_test$boilingpoint)
```
On this particular test set,  0.935. (another run: 0.941) only a very small improvement on the SMILES model.
Compare: what is now the correlation of my final model (bp_model_2) between predicted and observed boiling points? 0.99 (on another run : 0.979)
```{r}
cor(pred_bp_2,bp_test_2$boilingpoint)
```
So, the data-driven model with few latent variables created from a large number of variables with a chemical structural meaning is an improvement on the first simple model that I created from handpicked CDK descriptors with only one latent variable. That first model had a similar predicted/oberved correlation to the initial ultra-simple SMILES stringlength vs boiling point comparison.

**The Final model (data-driven - 3 latent variables) RMSEP varies between 30 and 56, depending on the randomly selected test set.**


###Comparing this to literature : 
Wiener constructed a relationship with 3 variables n (number of Carbon atoms) and w and p.


###Reusebility of the code for other types of molecules
It would be interesting to repeat and apply to **more chemical structures** possibly alcohols. Currently there is a lack of available boiling point data in WikiData on alcohols and other organic molecules.
We need more data to create a good model because
we might expect a model trained for more complex organic molecules to be, itself, more complex (require more latent variables) in order to make similarly successful predictions.

###END.