Skip to content

Commit

Permalink
Add CONTRIBUTING.md
Browse files Browse the repository at this point in the history
*  refactor
  • Loading branch information
batson authored Apr 25, 2018
1 parent ec1889d commit 6cbcc85
Show file tree
Hide file tree
Showing 4 changed files with 46 additions and 68 deletions.
13 changes: 13 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Contributions should be made through pull requests. Each PR should consist of
a markdown file containing

1. a description of the dataset including a link to the appropriate publication or
reference.
2. direct links to download the count matrix (in the form of an easy-to-load file, like a `rds` file containing an sparse matrix for R and an [AnnData](https://github.com/theislab/anndata) `hdf5` file or
a `mtx` file for python).
3. direct links to download the metadata (in a `csv` with rows indexed by cell names).
4. sample loading code for R and python.

An example is [datasets/tabula_muris.md](datasets/tabula_muris.md)

__How easy can you make it for someone to get started?__
71 changes: 5 additions & 66 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,72 +1,11 @@
# easy-data

Easy access to a small collection of benchmark datasets for methods development, focused on supporting projects at the hca-comp-tools workshop. Add your benchmarking desiderata and your datasets below.
Easy access to a small collection of benchmark datasets for methods development.

## benchmark for what?
# Instructions

A discussion of the problems for which benchmark datasets would allow for experimentation.
Instructions for downloading and loading each dataset are in text files in the `datasets` folder.

* cell type annotation and reannotation at various levels of ontological depth
* building and validating cell type classifiers
* manifold alignment and batch-effect-aware analyses
* assessing the variability in gene expression of cell types present in many organs
* measuring sex differences in gene expression
* measuring the variability in biological claims (like which genes are differentially expressed between populations) to be expected between different studies of the same cell types
For example, Tabula Muris is described in [datasets/tabula_muris.md](datasets/tabula_muris.md).

## There are several sources of public datasets. The question is, what should be the characteristics of a benchmark dataset?
At the most basic level, it should be very easy to access (e.g. free/open, in data formats that people use, easy to access). Then there are different requirements depending on what is being benchmarked, such as:
* clustering would want to see a mix of easy and difficult to cluster data.
* portals would want a mix of small and large data sets to test development (quick test with small data) and scalability (test with large data)
* for manifold alignment, datasets that have batch effect artifacts
* trajectories would want data that actually contains trajectories e.g. developmental biology data, including time series data
* control perturbations from well known experimental conditions are also helpful for benchmarking

# datasets

To add a dataset, just create a section with a description and links to download it.

How easy can you make it for someone to get started?

## `tabula muris`

[Tabula Muris](http://tabula-muris.ds.czbiohub.org/) contains about 100,000 cells from 20 organs and tissues in mouse. The study is sex-balanced, with four male and four female mice. The organs included are skin, fat, mammary gland, heart, bladder, brain, thymus, spleen, kidney, limb muscle, tongue, marrow, trachea, pancreas, lung, large intestine, and liver. Many of these organs were processed using two methods: SMART-seq2 on FACS-sorted cells and microfluidic droplets from 10X Genomics.

Below are instructions for getting four files: metadata (including annotations) and count data for each dataset.

### metadata

Version-controlled metadata are available on [github](https://github.com/czbiohub/tabula-muris-vignettes/tree/master/data).

[TM_droplet_metadata.csv](https://github.com/czbiohub/tabula-muris-vignettes/blob/master/data/TM_droplet_metadata.csv?raw=true)

[TM_facs_metadata.csv](https://github.com/czbiohub/tabula-muris-vignettes/blob/master/data/TM_facs_metadata.csv?raw=true)

### count files for R

You can download complete count files as sparse matrices in `.rds` format for easy loading into `R`. Unzip [TabulaMuris.zip](https://s3.amazonaws.com/czbiohub-tabula-muris/TabulaMuris.zip). Load:

```R
tm.droplet.matrix = readRDS(here("data", "TM_droplet_mat.rds"))
tm.droplet.metadata = read_csv(here("data", "TM_droplet_metadata.csv"))
```

### count files for Python

You can download complete count files as sparse matrices in [AnnData](http://anndata.readthedocs.io/en/latest/)-formatted h5ad files for use in Python [here](https://s3.amazonaws.com/czbiohub-tabula-muris/TabulaMuris.h5ad.zip). You can load them using the [Scanpy](http://scanpy.readthedocs.io/en/latest/index.html) library:

```python
import pandas
import scanpy

tm_facs_metadata = pd.read_csv('data/TM_facs_metadata.csv')
tm_facs_data = scanpy.anndata.read_h5ad('data/TM_facs_mat.h5ad')
```
### CSV and MTX files

The original data release is on [FigShare](https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733).


## Software Packages

- [CellBench](https://github.com/LuyiTian/CellBench_data)
- [IA-SVA](https://github.com/UcarLab/IA-SVA)
If you would like to add a dataset, follow the instructions in [CONTRIBUTING.md](CONTRIBUTING.md).
26 changes: 26 additions & 0 deletions benchmarks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@

# benchmark for what?

A discussion of the problems for which benchmark datasets would allow for experimentation.

* cell type annotation and reannotation at various levels of ontological depth
* building and validating cell type classifiers
* manifold alignment and batch-effect-aware analyses
* assessing the variability in gene expression of cell types present in many organs
* measuring sex differences in gene expression
* measuring the variability in biological claims (like which genes are differentially expressed between populations) to be expected between different studies of the same cell types

There are several sources of public datasets. The question is, what should be the characteristics of a benchmark dataset?
At the most basic level, it should be very easy to access (e.g. free/open, in data formats that people use, easy to access). Then there are different requirements depending on what is being benchmarked, such as:
* clustering would want to see a mix of easy and difficult to cluster data.
* portals would want a mix of small and large data sets to test development (quick test with small data) and scalability (test with large data)
* for manifold alignment, datasets that have batch effect artifacts
* trajectories would want data that actually contains trajectories e.g. developmental biology data, including time series data
* control perturbations from well known experimental conditions are also helpful for benchmarking

## Benchmarking resources

## Software Packages

- [CellBench](https://github.com/LuyiTian/CellBench_data)
- [IA-SVA](https://github.com/UcarLab/IA-SVA)
4 changes: 2 additions & 2 deletions datasets/tabula_muris.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ Version-controlled metadata are available on [github](https://github.com/czbioh
You can download complete count files as sparse matrices in `.rds` format for easy loading into `R`. Unzip [TabulaMuris.zip](https://s3.amazonaws.com/czbiohub-tabula-muris/TabulaMuris.zip). Load:

```R
tm.droplet.matrix = readRDS(here("data", "TM_droplet_mat.rds"))
tm.droplet.metadata = read_csv(here("data", "TM_droplet_metadata.csv"))
tm.droplet.matrix = readRDS("TM_droplet_mat.rds")
tm.droplet.metadata = read_csv("TM_droplet_metadata.csv")
```

## Count files for Python
Expand Down

0 comments on commit 6cbcc85

Please sign in to comment.