README

This repository contains the code to reproduce our dimensionality reduction methods benchmark, with a focus on single cells. The same code bundled with input data and computations' outputs is available at https://figshare.com/s/9c3a0136f12b97f1dadd

Our article is now published at https://www.nature.com/articles/nbt.4314 and an earlier version is available as a pre-print at https://www.biorxiv.org/content/early/2018/04/10/298430

To run the benchmark

If you want to re-run the dimensionality reduction methods:
- Install python3 and the umap-learn package (using pip or https://github.com/lmcinnes/umap)
- Install FItSNE (https://github.com/KlugerLab/FIt-SNE) then copy the FIt-SNE folder to this directory. Edit the fast_tsne.R file and replace the default "fast_tsne_path" argument by "../FIt-SNE/bin/fast_tsne"
- Install scvis (https://bitbucket.org/jerry00/scvis-dev)

Please remember that every script requires you to set the working directory to the one where the script lies before running it

If you only want to re-generate the graphs, all intermediary results are saved: most of the heavy computations' results are saved and bundled together with the code. The full benchmark takes about a week to complete on our 8 core computer, but if you only benchmark UMAP and t-SNE it should only take a few hours. Most scripts are either to run dimensionality reduction or to make figures, usually not both.

Here is a brief summary of what every script does :

** Main figure scripts **
* Generate the embeddings and corresponding main figures of the paper

 ./Han/data_parsing.R : Parsing input data for the Han[Bone marrow + PBMC] dataset
 ./Han/graphs_full_dataset.R : Filtering the Han[Bone marrow + PBMC] dataset
 ./Han/DR_on_data_subset.R : Dimensionality reduction on the subsetted Han[Bone marrow + PBMC] dataset
 ./Han/graphs_paper.R : Graphs for the Han[Bone marrow + PBMC] dataset (Fig 2CDEF)
 
 ./Samusik/Samusik_01_data_load_and_DR.R : Parsing, dimensionality reduction and graphs for the Samusik_01 dataset (Fig 2AB)

 ./Wong/RunFCStSNEone TraffickTNK 10k.R : Downsample to 10k cells per sample, run t-SNE and UMAP on the Wong dataset. This is a script used within our lab meant for non-programmers to use these tools and it is shared as is. This script is an interface to functions definned in./Wong/FCStSNEone.R
 ./Wong/data_parsing.R : Data parsing for the Wong datasets
 ./Wong/Wong.R : Graphs for the Wong datasets (Fig 1 ABCD)
 ./Wong/FCStSNEone.R : Not to be ran

** Extended benchmark scripts [Supplementary Notes] **

* Parse inputs for the benchmark [Supplementary notes and most of the supplementary figures]

 ./Han_400k/data_parsing_full_dataset.R : Parsing the Han_400k dataset
 ./Han_400k/DR.R : Running UMAP, t-SNE, FIt-SNE, FIt-SNE l.e. on the Han_400k dataset

 ./Samusik/Samusik_all_data_load_and_DR.R : Parsing the Samusik_all dataset and running UMAP, t-SNE, FIt-SNE, FIt-SNE l.e.

* Generate embeddings for the benchmark

 ./Benchmark_global/DR.R : Generates embeddings for subsamples t-SNE, FIt-SNE, FIt-SNE l.e. and UMAP for the Wong, Han_400k and Samusik datasets
 ./Benchmark_global/scvis.R : Run scvis on the above datasets
 ./Benchmark_global/rdata/Single seeds/combine_seeds.R : This is to recompile results from the benchmark if it could not be completed in one go. This also adds scvis.R outputs to DR.R outputs
 
* Generate plots for the benchmark
 ./Benchmark_global/graphs_benchmark.R [Fig 3, Fig 4b, Fig 5, Fig 6, Fig S5, Fig S6]
 ./Benchmark_global/random_forests.R [Fig 4A]
 
* Misc
 ./Benchmark_global/tSNE_parameters.R : Small benchmark perplexity and max iterations for t-SNE
 ./FIt-SNE/fast_tsne.R : R wrapper for FIt-SNE. Install using release on https://github.com/KlugerLab/FIt-SNE .
 ./utils.R : Some functions that are used throughout the other scripts. You may need to specify your python executable here (default to /usr/bin/python3)