Add CONTRIBUTING.md

* refactor
sam-morris · Apr 25, 2018 · 6cbcc85 · 6cbcc85
1 parent ec1889d
commit 6cbcc85
Show file tree

Hide file tree

Showing 4 changed files with 46 additions and 68 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,13 @@
+Contributions should be made through pull requests. Each PR should consist of
+a markdown file containing
+
+1. a description of the dataset including a link to the appropriate publication or
+reference.
+2. direct links to download the count matrix (in the form of an easy-to-load file, like a `rds` file containing an sparse matrix for R and an [AnnData](https://github.com/theislab/anndata) `hdf5` file or
+  a `mtx` file for python).
+3. direct links to download the metadata (in a `csv` with rows indexed by cell names).
+4. sample loading code for R and python.
+
+An example is [datasets/tabula_muris.md](datasets/tabula_muris.md)
+
+__How easy can you make it for someone to get started?__
diff --git a/README.md b/README.md
@@ -1,72 +1,11 @@
 # easy-data
 
-Easy access to a small collection of benchmark datasets for methods development, focused on supporting projects at the hca-comp-tools workshop. Add your benchmarking desiderata and your datasets below.
+Easy access to a small collection of benchmark datasets for methods development.
 
-## benchmark for what?
+# Instructions
 
-A discussion of the problems for which benchmark datasets would allow for experimentation.
+Instructions for downloading and loading each dataset are in text files in the `datasets` folder.
 
-* cell type annotation and reannotation at various levels of ontological depth
-* building and validating cell type classifiers
-* manifold alignment and batch-effect-aware analyses
-* assessing the variability in gene expression of cell types present in many organs
-* measuring sex differences in gene expression
-* measuring the variability in biological claims (like which genes are differentially expressed between populations) to be expected between different studies of the same cell types
+For example, Tabula Muris is described in [datasets/tabula_muris.md](datasets/tabula_muris.md).
 
-## There are several sources of public datasets.  The question is, what should be the characteristics of a benchmark dataset?
-At the most basic level, it should be very easy to access (e.g. free/open, in data formats that people use, easy to access).  Then there are different requirements depending on what is being benchmarked, such as:
-* clustering would want to see a mix of easy and difficult to cluster data.
-* portals would want a mix of small and large data sets to test development (quick test with small data) and scalability (test with large data)
-* for manifold alignment, datasets that have batch effect artifacts
-* trajectories would want data that actually contains trajectories e.g. developmental biology data, including time series data
-* control perturbations from well known experimental conditions are also helpful for benchmarking
-
-# datasets
-
-To add a dataset, just create a section with a description and links to download it.
-
-How easy can you make it for someone to get started?
-
-## `tabula muris`
-
-[Tabula Muris](http://tabula-muris.ds.czbiohub.org/) contains about 100,000 cells from 20 organs and tissues in mouse. The study is sex-balanced, with four male and four female mice. The organs included are skin, fat, mammary gland, heart, bladder, brain, thymus, spleen, kidney, limb muscle, tongue, marrow, trachea, pancreas, lung, large intestine, and liver. Many of these organs were processed using two methods: SMART-seq2 on FACS-sorted cells and microfluidic droplets from 10X Genomics.
-
-Below are instructions for getting four files: metadata (including annotations) and count data for each dataset.
-
-### metadata
-
-Version-controlled metadata are available on  [github](https://github.com/czbiohub/tabula-muris-vignettes/tree/master/data).
-
-[TM_droplet_metadata.csv](https://github.com/czbiohub/tabula-muris-vignettes/blob/master/data/TM_droplet_metadata.csv?raw=true)
-
-[TM_facs_metadata.csv](https://github.com/czbiohub/tabula-muris-vignettes/blob/master/data/TM_facs_metadata.csv?raw=true)
-
-### count files for R
-
-You can download complete count files as sparse matrices in `.rds` format for easy loading into `R`. Unzip [TabulaMuris.zip](https://s3.amazonaws.com/czbiohub-tabula-muris/TabulaMuris.zip). Load:
-
-```R
-tm.droplet.matrix = readRDS(here("data", "TM_droplet_mat.rds"))
-tm.droplet.metadata = read_csv(here("data", "TM_droplet_metadata.csv"))
-```
-
-### count files for Python
-
-You can download complete count files as sparse matrices in [AnnData](http://anndata.readthedocs.io/en/latest/)-formatted h5ad files for use in Python [here](https://s3.amazonaws.com/czbiohub-tabula-muris/TabulaMuris.h5ad.zip). You can load them using the [Scanpy](http://scanpy.readthedocs.io/en/latest/index.html) library:
-
-```python
-import pandas
-import scanpy
-
-tm_facs_metadata = pd.read_csv('data/TM_facs_metadata.csv')
-tm_facs_data = scanpy.anndata.read_h5ad('data/TM_facs_mat.h5ad')
-```
-### CSV and MTX files
-
-The original data release is on [FigShare](https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733).
-
-
-## Software Packages
-
-- [CellBench](https://github.com/LuyiTian/CellBench_data)
-- [IA-SVA](https://github.com/UcarLab/IA-SVA)
+If you would like to add a dataset, follow the instructions in [CONTRIBUTING.md](CONTRIBUTING.md).
diff --git a/benchmarks.md b/benchmarks.md
@@ -0,0 +1,26 @@
+
+# benchmark for what?
+
+A discussion of the problems for which benchmark datasets would allow for experimentation.
+
+* cell type annotation and reannotation at various levels of ontological depth
+* building and validating cell type classifiers
+* manifold alignment and batch-effect-aware analyses
+* assessing the variability in gene expression of cell types present in many organs
+* measuring sex differences in gene expression
+* measuring the variability in biological claims (like which genes are differentially expressed between populations) to be expected between different studies of the same cell types
+
+There are several sources of public datasets.  The question is, what should be the characteristics of a benchmark dataset?
+At the most basic level, it should be very easy to access (e.g. free/open, in data formats that people use, easy to access).  Then there are different requirements depending on what is being benchmarked, such as:
+* clustering would want to see a mix of easy and difficult to cluster data.
+* portals would want a mix of small and large data sets to test development (quick test with small data) and scalability (test with large data)
+* for manifold alignment, datasets that have batch effect artifacts
+* trajectories would want data that actually contains trajectories e.g. developmental biology data, including time series data
+* control perturbations from well known experimental conditions are also helpful for benchmarking
+
+## Benchmarking resources
+
+## Software Packages
+
+- [CellBench](https://github.com/LuyiTian/CellBench_data)
+- [IA-SVA](https://github.com/UcarLab/IA-SVA)
diff --git a/datasets/tabula_muris.md b/datasets/tabula_muris.md
@@ -17,8 +17,8 @@ Version-controlled metadata are available on  [github](https://github.com/czbioh
 You can download complete count files as sparse matrices in `.rds` format for easy loading into `R`. Unzip [TabulaMuris.zip](https://s3.amazonaws.com/czbiohub-tabula-muris/TabulaMuris.zip). Load:
 
 ```R
-tm.droplet.matrix = readRDS(here("data", "TM_droplet_mat.rds"))
-tm.droplet.metadata = read_csv(here("data", "TM_droplet_metadata.csv"))
+tm.droplet.matrix = readRDS("TM_droplet_mat.rds")
+tm.droplet.metadata = read_csv("TM_droplet_metadata.csv")
 ```
 
 ## Count files for Python