-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathREADME.Rmd
296 lines (231 loc) · 9.73 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# cuda.ml
<!-- badges: start -->
[![CRAN\_Status\_Badge](https://www.r-pkg.org/badges/version/cuda.ml)](https://cran.r-project.org/package=cuda.ml)
<a href="https://www.r-pkg.org/pkg/cuda.ml"><img src="https://cranlogs.r-pkg.org/badges/cuda.ml?color=brightgreen" style=""></a>
<!-- badges: end -->
The goal of {cuda.ml} is to provide a simple and intuitive R interface for
[RAPIDS cuML](https://github.com/rapidsai/cuml).
RAPIDS cuML is a suite of GPU-accelerated machine learning libraries powered by
[CUDA](https://en.wikipedia.org/wiki/CUDA).
{cuda.ml} is under active development, and currently implements R interfaces for
the algorithms listed below
(which is a subset of
[algorithms supported by RAPIDS cuML](https://github.com/rapidsai/cuml#supported-algorithms)).
### Supported Algorithms
| Category | Algorithm | Notes |
| --- | --- | --- |
| **Clustering** | Density-Based Spatial Clustering of Applications with Noise (DBSCAN) | Only single-GPU implementation is supported at the moment |
| | K-Means | Only single-GPU implementation is supported at the moment |
| | Single-Linkage Agglomerative Clustering | |
| **Dimensionality Reduction** | Principal Components Analysis (PCA) | Only single-GPU implementation is supported at the moment |
| | Truncated Singular Value Decomposition (tSVD) | Only single-GPU implementation is supported at the moment |
| | Uniform Manifold Approximation and Projection (UMAP) | Only single-GPU implementation is supported at the moment |
| | Random Projection | |
| | t-Distributed Stochastic Neighbor Embedding (TSNE) | |
| **Linear Models for Regression or Classification** | Linear Regression (OLS) | |
| | Linear Regression with Lasso or Ridge Regularization | |
| | Logistic Regression | |
| **Nonlinear Models for Regression or Classification** | Random Forest (RF) Classification | Only single-GPU implementation is supported at the moment |
| | Random Forest (RF) Regression | Only single-GPU implementation is supported at the moment |
| | Inference for decision tree-based models in XGBoost or LightGBM formats using the CuML Forest Inference Library (FIL) | Requires linkage to the Treelite C library when {cuml} is installed. Treelite is used for model loading. |
| | K-Nearest Neighbors (KNN) Classification | Uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |
| | K-Nearest Neighbors (KNN) Regression | Uses [Faiss](https://github.com/facebookresearch/faiss) for Nearest Neighbors Query. |
| | Support Vector Machine Classifier (SVC) | |
| | Epsilon-Support Vector Regression (SVR) | |
# Examples
## Using {cuda.ml} for supervised ML tasks through {parsnip}
{cuda.ml} provides {parsnip} bindings for supervised ML algorithms such as
`rand_forest`, `nearest_neighbor`, `svm_rbf`, `svm_poly`, and `svm_linear`.
The following example shows how {cuda.ml} can be used as a {parsnip} engine to
build a SVM classifier.
```{r svm-example}
library(dplyr, warn.conflicts = FALSE)
library(parsnip)
library(cuda.ml)
set.seed(11235)
train_inds <- iris %>%
mutate(ind = row_number()) %>%
group_by(Species) %>%
slice_sample(prop = 0.7)
train_data <- iris[train_inds$ind, ]
test_data <- iris[-train_inds$ind, ]
model <- svm_rbf(mode = "classification", rbf_sigma = 10, cost = 50) %>%
set_engine("cuda.ml") %>%
fit(Species ~ ., data = train_data)
preds <- predict(model, test_data)
cat("Confusion matrix:\n\n")
preds %>%
bind_cols(test_data %>% select(Species)) %>%
yardstick::conf_mat(truth = Species, estimate = .pred_class)
```
## Using {cuda.ml} for unsupervised ML tasks
The following example shows how {cuda.ml} can be used for unsupervised ML tasks
such as k-means clustering.
```{r kmeans-example}
library(cuda.ml)
clustering <- cuda_ml_kmeans(
iris[, which(names(iris) != "Species")],
k = 3, max_iters = 100
)
# Expected outcome: there is strong correlation
# between cluster labels and `iris$Species`
print(clustering)
library(dplyr, warn.conflicts = FALSE)
tibble(cluster_id = clustering$labels, species = iris$Species) %>%
group_by(cluster_id) %>% count(species)
```
## Using {cuda.ml} for visualizations
{cuda.ml} also features R interfaces for algorithms such as UMAP and t-SNE,
which are useful when one needs to visualize clusters of high-dimensional data
points by embedding them onto low-dimensional manifolds (i.e., 4 dimensions or
fewer).
For example, the code snippet below shows how `cuda_ml_umap()` can be used to
visualize the MNIST hand-written digits dataset, and also, the coloring based on
the true label of each sample demonstrates how well the UMAP algorithm
transforms different hand writings of the same digit into nearby points in a 2D
embedding:
```{r umap-example, cache = TRUE}
library(cuda.ml)
library(ggplot2)
library(magrittr)
# load mnist
source("data-raw/load-mnist.R")
str(mnist_images)
str(mnist_labels)
# flatten each image to a 1d array, combine into a matrix with 1 row per image
flatten <- function(img) {
dim(img) <- NULL
img
}
flattened_mnist_images <-
mnist_images %>% asplit(3) %>% lapply(flatten) %>% do.call(rbind, .)
# embed
embedding <- cuda_ml_umap(
flattened_mnist_images, n_components = 2, n_neighbors = 50,
local_connectivity = 15, repulsion_strength = 10
)
str(embedding$transformed_data)
# visualize
embedding$transformed_data %>%
as.data.frame() %>%
dplyr::mutate(Label = factor(mnist_labels)) %>%
ggplot(aes(x = V1, y = V2, color = Label)) +
geom_point(alpha = .5, size = .5) +
labs(title = "UMAP: Uniform Manifold Approximation and Projection",
subtitle = "Two Dimensional Embedding of MNIST")
```
From this type of visualization, we can qualitatively understand the following
about the MNIST dataset:
- The dataset can be reasonably classified into some number of categories.
- The right number of categories may be any where between 9 and 11.
- While there are some categories that are clearly distinguishable from others,
there are others that have less clear boundaries with their neighbors.
- A small fraction of data points did not fit particularly well into any of the
categories.
- Most data points belonging to the same digit category are clustered together
in the UMAP output
## Installation
In order for {cuda.ml} to work as expected, the C++/CUDA source code of
{cuda.ml} must be linked with CUDA runtime and a valid copy of the RAPIDS cuML
library.
Before installing {cuda.ml} itself, it may be worthwhile to take a quick look
through the sub-sections below on how to properly setup all of {cuda.ml}'s
required runtime dependencies.
### Quick note on installing the RAPIDS cuML library:
Although Conda is the only officially supported distribution channel at the
moment for RAPIDS cuML (i.e., see https://rapids.ai/start.html#get-rapids),
you can still build and install this library from source without relying on
Conda.
See https://github.com/yitao-li/cuml-installation-notes for build-from-source
instructions.
### Quick install instructions for Ubuntu 20-04:
#### Install deps:
```
sudo apt install -y cmake ccache libblas3 liblapack3
```
### Install CUDA
(consult https://developer.nvidia.com/cuda-downloads for other platforms)
```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.4.2/local_installers/cuda-repo-ubuntu2004-11-4-local_11.4.2-470.57.02-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-4-local_11.4.2-470.57.02-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-4-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
```
### Add CUDA executables to path
(nvcc is needed for building the C++/CUDA source code of {cuda.ml})
```bash
echo "export PATH=$PATH:/usr/local/cuda/bin" >> ~/.bashrc
source ~/.bashrc
```
### Install Miniconda:
```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b
# consult https://rapids.ai/start.html for alternatives
```
### Create and configure the conda env
```
# This is a relatively big download, may take a while
~/miniconda3/bin/conda create -n rapids-21.08 -c rapidsai -c nvidia -c conda-forge \
rapids-blazing=21.08 python=3.8 cudatoolkit=11.2
```
### Install cmake
CUDA dependencies require a relatively recent version of CMake, so you need to install it manually
```bash
wget https://github.com/Kitware/CMake/releases/download/v3.22.0/cmake-3.22.0.tar.gz
cd cmake-3.22.0
./bootstrap && make -j8 && sudo make install
cd ..
```
### Activate the conda env:
```bash
. ~/miniconda3/bin/activate
conda activate rapids-21.08
```
### Consider adjusting `LD_LIBRARY_PATH`
The subsequent steps may (or may not) fail without the following:
```bash
export LD_LIBRARY_PATH=~/miniconda3/envs/rapids-21.08/lib
```
If you get some error indicating a GLIBC version mismatch in the subsequent
steps, then please try adjusting `LD_LIBRARY_PATH` as a workaround.
### Consider enabling ccache
To speed up recompilation times during development, set this env var:
```bash
echo "export CUML4R_ENABLE_CCACHE=1" >> ~/.bashrc
. ~/.bashrc
```
### Install {cuda.ml} the R package:
You can install the released version of {cuda.ml} from
[CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("cuda.ml")
```
And the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("mlverse/cuda.ml")
```
## Appendix
<details>
<summary>Inspect MNIST images</summary>
```{r mnist}
plot_mnist(1:64)
```
</details>