-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrcdk tutorial.Rmd
276 lines (235 loc) · 13 KB
/
rcdk tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
title: "rcdk practice"
author: "Caroline Collins 6192527"
date: "27 September 2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
It can be convenient to be able to manipulate chemical structures and generate chemical information with the R environment.
The CDK is a Java library for cheminformatics that supports a wide variety of cheminformatics functionality ranging from reading molecular file formats, performing ring perception and armaticity detection to fingerprint generation and molecular descriptors. The CDK website provides links to useful documentation as well as complete Javadocs
##Getting started
The goal of the rcdk package is to allow an R user to access the cheminformatics functionality of the CDK from within R. While one can use the rJava package to make direct calls to specific methods in the CDK, from R, such usage does not usually follow common R idioms. Thus rcdk aims to allow users to use the CDK classes and methods in an R-like fashion.
The library is loaded as follows
```{r}
library(rcdk)
```
The package also provides an example data set, called bpdata which contains
277 molecules, in SMILES format and
their associated boiling points (BP) in Kelvin.
The data.frame has two columns, viz., the SMILES and the BP.
Molecules names are used as row names:
```{r}
data(bpdata)
head(bpdata)
```
##Input and Output
Chemical structures come in a variety of formats and the CDK supports many of them. Many such formats are disk based and these files can be parsed and loaded by specifying their full paths
```{r}
#mols <- load.molecules( c('data1.sdf', '/some/path/data2.sdf') )
```
Note that the above function will load any file format that is supported by the CDK, so there's no need to specify formats. In addition one can specify a URL (which should start with http://) to specify remote files as well. The result of this function is a list of molecule objects. The molecule objects are of class jobjRef (provided by the rJava package). As a result,they are pretty opaque to the user and are really meant to be processed using methods from the rcdk or rJava packages.
However, since it loads all the molecules from the specified file into a list, large files can lead to out of memory errors. In such a situtation it is preferable to iterate over the file, one structure at a time. Currently this behavior is supported for SDF and SMILES files. An example of such a usage for a large SD file would be
```{r}
# iter <- iload.molecules('verybig.sdf', type='sdf')
# while(hasNext(iter)) {
# mol <- nextElem(iter)
# print(get.property(mol, "cdk:Title"))
# }
```
##Parsing SMILES
Another common way to obtain molecule objects is by parsing SMILES strings. The simplest way to do this is
```{r}
smile <- 'c1ccccc1CC(=O)C(N)CC1CCCCOC1'
mol <- parse.smiles(smile)[[1]]
```
Usage is more efficient when multiple SMILE are supplied, since then a single SMILES parser object is used to parse all the supplied SMILES.
If you plan on parsing a large number of SMILES, you may run into memory issues, due to the large size of IAtomContainer objects. In such a case, it can be useful to call the Java and R garbage collectors explicitly at the appropriate time. In addition it can be useful to explicitly allocate a large amount of memory for the JVM. For example,
```{r}
options("java.parameters"=c("-Xmx4000m"))
library(rcdk)
for (smile in smiles) {
m <- parse.smiles(smile)
## perform operations on this molecule
jcall("java/lang/System","V","gc")
gc()
}
```
Given a list of molecule objects, it is possible to serialize them to a file in some specified format. Currently, the only output formats are SMILES or SDF. To write molecules to a disk file in SDF format.
```{r}
write.molecules(mols, filename='mymols.sdf')
```
By default, if mols is a list of multiple molecules, all of them will be written to a single SDF file. If this is not desired, you can write each on to individual files (which are prefixed by the value of filename):
```{r}
write.molecules(mols, filename='mymols.sdf', together=FALSE)
```
##Generating SMILES
Finally, we can generate a SMILES representation of a molecule using
```{r}
smiles <- c('CCC', 'c1ccccc1', 'CCCC(C)(C)CC(=O)NC') mols <- parse.smiles(smiles) get.smiles(mols[[1]])
## [1] "CCC"
unlist(lapply(mols, get.smiles))
## CCC c1ccccc1 CCCC(C)(C)CC(=O)NC ## "CCC" "C1=CC=CC=C1" "CCCC(C)(C)CC(=O)NC"
```
The CDK supports a number of flavors when generating SMILES.
For example, you can generate a SMILES with or without chirality information or generate SMILES in Kekule form.
The
```{r}
smiles.flavors
```
smiles.flavors generates an object that represents the various flavors desired for SMILES output. See the SmiFlavor javadocs for the full list of possible flavors. Eaxmple usage is
```{r}
smiles <- c('CCC', 'c1ccccc1', 'CCc1ccccc1CC(C)(C)CC(=O)NC') mols <- parse.smiles(smiles) get.smiles(mols[[3]], smiles.flavors(c('UseAromaticSymbols')))
## [1] "CCc1ccccc1CC(C)(C)CC(=O)NC"
get.smiles(mols[[3]], smiles.flavors(c('Generic','CxSmiles')))
## [1] "CCC1=CC=CC=C1CC(C)(C)CC(=O)NC"
```
Using the CxSmiles flavors allows the user to encode a variety of information in the SMILES string, such as 2D or 3D coordinates.
```{r}
m <- parse.smiles('CCC')[[1]]
m <- generate.2d.coordinates(m)
get.smiles(m, smiles.flavors(c('CxSmiles')))
get.smiles(m, smiles.flavors(c('CxCoordinates')))
```
##Atoms and Bonds
Probably the most important thing to do is to get the atoms and bonds of a molecule. The code below gets the atoms and bonds as lists of jobjRef objects, which can be manipulated using rJava or via other methods of this package.
```{r}
mol <- parse.smiles('c1ccccc1C(Cl)(Br)c1ccccc1')[[1]]
atoms <- get.atoms(mol)
bonds <- get.bonds(mol)
cat('No. of atoms =', length(atoms), '\n')
```
```{r}
cat('No. of bonds =', length(bonds), '\n')
```
```{r}
unlist(lapply(atoms, get.symbol))
```
```{r}
coords <- get.point3d(atoms[[1]])
```
Once you have the 3D coordinate matrix, a quick way to check whether the molecule is flat is to do
```{r}
if ( any(apply(coords, 2, function(x) length(unique(x))) == 1) ) {
print("molecule is flat")
}
```
##Substructure matching
```{r}
mols <- parse.smiles(c('CC(C)(C)C','c1ccc(Cl)cc1C(=O)O', 'CCC(N)(N)CC'))
query <- '[#6D2]'
matches(query, mols)
```
##Molecular Descriptors
A key requirement for the predictive modeling of molecular properties and activities are molecular descriptors - numerical charaterizations of the molecular structure. The CDK implements a variety of molecular descriptors, categorized into topological, constitutional, geometric, electronic and hybrid. It is possible to evaluate all available descriptors at one go, or evaluate individual descriptors.
First, we can take a look at the available descriptor categories.
```{r}
dc <- get.desc.categories()
dc
```
Get the descriptor names for category 4 - electronic
```{r}
dn <- get.desc.names(dc[4])
dn
```
Get the descriptor names for category 1 - hybrid
```{r}
dn <- get.desc.names(dc[1])
dn
```
Each descriptor name is actually a fully qualified Java class name for the corresponding descriptor. These names can be supplied to eval.desc to evaluate a single or multiple descriptors for one or more molecules.
```{r}
aDesc <- eval.desc(mol, dn[4])
allDescs <- eval.desc(mol, dn)
```
The return value of eval.desc is a data.frame with the descriptors in the columns and the molecules in the rows. For the above example we get a single row. But given a list of molecules, we can easily get a descriptor matrix.
##build a linear regression model
to predict boiling points for the BP dataset.
First we need a set of descriptors and so we evaluate all available descriptors. Also note that since a descriptor might belong to more than one category, we should obtain a unique set of descriptor names
```{r}
descNames <- unique(unlist(sapply(get.desc.categories(), get.desc.names)))
descNames
```
For the current discussion we focus on a few, manually selected descriptors that we know will be related to boiling point.
```{r}
data(bpdata)
mols <- parse.smiles(bpdata[,1])
descNames <- c(
'org.openscience.cdk.qsar.descriptors.molecular.KierHallSmartsDescriptor',
'org.openscience.cdk.qsar.descriptors.molecular.APolDescriptor',
'org.openscience.cdk.qsar.descriptors.molecular.HBondDonorCountDescriptor')
descs <- eval.desc(mols, descNames)
class(descs)
```
```{r}
dim(descs)
```
When a descriptor value cannot be computed, it's value is set to NA. This may happen if a descriptor requires 3D coordinates, but only 2D coordinates are available. In this case, we have manually selected descriptors such that there will be no undefined values.
Given the ubiquity of certain descriptors, some of them are directly available via their own functions. Specifically, one can calculate TPSA (topological polar surface area), AlogP and XlogP without having to go through eval.desc. (Note that AlogP and XlogP assume that hydrogens are explicitly specified in the molecule. This may not be true if the molecules were obtained from SMILES)
```{r}
mol <- parse.smiles('CC(=O)CC(=O)NCN')[[1]]
convert.implicit.to.explicit(mol)
get.tpsa(mol)
```
```{r}
get.xlogp(mol)
```
```{r}
get.alogp(mol)
```
Now that we have a
##descriptor matrix,
we easily build a linear regression model.
First, remove NA's,
correlated and constant columns.
The code is shown below, but since it involves a stochastic element, we will not run it for this example. If we were to perform feature selection, then this type of reduction would have to be performed.
```{r}
descs <- descs[, !apply(descs, 2, function(x) any(is.na(x)) )]
descs <- descs[, !apply( descs, 2, function(x) length(unique(x)) == 1 )]
r2 <- which(cor(descs)^2 > .6, arr.ind=TRUE)
r2 <- r2[ r2[,1] > r2[,2] , ]
descs <- descs[, -unique(r2[,2])]
```
Note that the above correlation reduction step is pretty crude and there are better ways to do it. Given the reduced descriptor matrix, we can perform feature selection (say using leaps, caret or a GA to identify a suitable subset of descriptors. Given that we selected the descriptors by hand, we can skip this section, and directly build the model and generate a plot of predicte versus observed BP. (Note that this is a toy example and is not an example of good QSAR practice!)
```{r}
model <- lm(BP ~ khs.sCH3 + khs.sF + apol + nHBDon, data.frame(bpdata, descs))
summary(model)
```
```{r}
plot(bpdata$BP, predict(model, descs),
xlab="Observed BP", ylab="Predicted BP",
pch=19, xlim=c(100, 700), ylim=c(100, 700))
abline(0,1, col='red')
```
##Fingerprints
Fingerprints are a common representation used for a variety of purposes such as similarity searching and predictive modeling. The CDK provides a variety of fingerprints ranging from path-based hashed fingerprints to circular (specifically, an implementation fo the ECFP fingerprints) and signature fingerprints (based on the signature molecular descriptor). Some of the fingerprints are represented as binary strings and other by integer vectors. The rcdk employs the fingerprint package to support operations on the resultant fingerprints.
In this section, we present an example of using fingerprints to generate a hierarchical clustering of a set of molecules from the included boiling point dataset. We first parse the SMILES for the molecules in the dataset and then compute the fingerprints, specifying the circular type.
```{r}
data(bpdata)
mols <- parse.smiles(bpdata[,1])
fps <- lapply(mols, get.fingerprint, type='circular')
```
With the fingerprints, we can then compute a pairwise similarity matrix using the Tanimoto metric. Since R's hclust method requires a distance matrix, we convert the similarity matrix to a distance matrix
```{r}
fp.sim <- fingerprint::fp.sim.matrix(fps, method='tanimoto')
fp.dist <- 1 - fp.sim
```
Finally, we can perform the clustering. In this case we use the hclust method though any of R's clustering methods could be used.
```{r}
cls <- hclust(as.dist(fp.dist))
plot(cls, main='A Clustering of the BP dataset', labels=FALSE)
```
Another common task for fingerprints is similarity searching. That is, given a collection of target molecules, find those molecules that are similar to a query molecule. This is achieved by evaluating a similarity metric between the query and each of the target molecules. Those target molecules exceeding a user defined cutoff will be returned. With the help of the fingerprint package this is easily accomplished.
For example, we can identify all the molecules in the BP dataset that have a Tanimoto similarity of 0.3 or more with acetalehyde, and then create a tabular summary. Note that this could also be accomplished with molecular descriptors, in which case you'd probably evaluate the Euclidean distance between descriptor vectors.
```{r}
query.mol <- parse.smiles('CC(=O)')[[1]]
target.mols <- parse.smiles(bpdata[,1])
query.fp <- get.fingerprint(query.mol, type='circular')
target.fps <- lapply(target.mols, get.fingerprint, type='circular')
sims <- data.frame(sim=do.call(rbind, lapply(target.fps,
fingerprint::distance,
fp2=query.fp, method='tanimoto')))
subset(sims, sim >= 0.3)
```
end of https://cran.r-project.org/web/packages/rcdk/vignettes/using-rcdk.html