diff --git a/02-microarray/clustering_microarray_01_heatmap.Rmd b/02-microarray/clustering_microarray_01_heatmap.Rmd index 57ebb68e..7ea1a126 100644 --- a/02-microarray/clustering_microarray_01_heatmap.Rmd +++ b/02-microarray/clustering_microarray_01_heatmap.Rmd @@ -112,7 +112,7 @@ Your new analysis folder should contain: - A folder for `plots` (currently empty) - A folder for `results` (currently empty) -Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using): +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): diff --git a/02-microarray/clustering_microarray_01_heatmap.html b/02-microarray/clustering_microarray_01_heatmap.html index 83f3ca2a..761f0518 100644 --- a/02-microarray/clustering_microarray_01_heatmap.html +++ b/02-microarray/clustering_microarray_01_heatmap.html @@ -1753,7 +1753,7 @@

2.6 Check out our file structure!
  • A folder for results (currently empty)
  • -

    Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using):

    +

    Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. Run this chunk to double check that your files are in the right place.

    # Define the file path to the data directory
    @@ -2025,16 +2025,16 @@ 

    6 Print session info

    ## loaded via a namespace (and not attached): ## [1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.2 RColorBrewer_1.1-2 ## [5] R.methodsS3_1.8.1 R.utils_2.10.1 tools_4.0.2 digest_0.6.25 -## [9] gtable_0.3.0 evaluate_0.14 lifecycle_0.2.0 tibble_3.0.3 +## [9] evaluate_0.14 lifecycle_0.2.0 tibble_3.0.3 gtable_0.3.0 ## [13] R.cache_0.14.0 pkgconfig_2.0.3 rlang_0.4.7 cli_2.0.2 ## [17] rstudioapi_0.11 yaml_2.2.1 xfun_0.17 dplyr_1.0.2 ## [21] styler_1.3.2 stringr_1.4.0 knitr_1.30 generics_0.0.2 ## [25] vctrs_0.3.4 hms_0.5.3 tidyselect_1.1.0 grid_4.0.2 ## [29] getopt_1.20.3 glue_1.4.2 R6_2.4.1 fansi_0.4.1 ## [33] rmarkdown_2.3 farver_2.0.3 purrr_0.3.4 readr_1.3.1 -## [37] rematch2_2.1.2 scales_1.1.1 backports_1.1.10 ellipsis_0.3.1 -## [41] htmltools_0.5.0 assertthat_0.2.1 colorspace_1.4-1 utf8_1.1.4 -## [45] stringi_1.5.3 munsell_0.5.0 crayon_1.3.4 R.oo_1.24.0
    +## [37] scales_1.1.1 backports_1.1.10 ellipsis_0.3.1 htmltools_0.5.0 +## [41] assertthat_0.2.1 colorspace_1.4-1 utf8_1.1.4 stringi_1.5.3 +## [45] munsell_0.5.0 crayon_1.3.4 R.oo_1.24.0

    Gu Z., R. Eils, and M. Schlesner, 2016 Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics.

    diff --git a/02-microarray/differential-expression_microarray_01_2-groups.Rmd b/02-microarray/differential-expression_microarray_01_2-groups.Rmd index ab752683..ae96b7e3 100644 --- a/02-microarray/differential-expression_microarray_01_2-groups.Rmd +++ b/02-microarray/differential-expression_microarray_01_2-groups.Rmd @@ -11,30 +11,30 @@ output: # Purpose of this analysis -This notebook takes data and metadata from refine.bio and identifies differentially expressed genes. +This notebook takes data and metadata from refine.bio and identifies differentially expressed genes. ⬇️ [**Jump to the analysis code**](#analysis) ⬇️ # How to run this example For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). -We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. ## Obtain the `.Rmd` file To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/02-microarray/differential-expression_microarray_01_2-groups.Rmd). You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) -Clicking this link will most likely send this to your downloads folder on your computer. +Clicking this link will most likely send this to your downloads folder on your computer. Move this `.Rmd` file to where you would like this example and its files to be stored. -## Set up your analysis folders +## Set up your analysis folders Good file organization is helpful for keeping your data analysis project on track! -We have set up some code that will automatically set up a folder structure for you. -Run this next chunk to set up your folders! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! -If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. ```{r} # Create the data folder if it doesn't exist @@ -63,7 +63,7 @@ In the same place you put this `.Rmd` file, you should now have three new empty ## Obtain the dataset from refine.bio -For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). +For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/GSE71270/creb-overexpression-induces-leukemia-in-zebrafish-by-blocking-myeloid-differentiation-process). @@ -76,7 +76,7 @@ Fill out the pop up window with your email and our Terms and Conditions: It may take a few minutes for the dataset to process. -You will get an email when it is ready. +You will get an email when it is ready. ## About the dataset we are using for this example @@ -86,22 +86,22 @@ In this analysis, we will test differential expression between the control and C ## Place the dataset in your new `data/` folder -refine.bio will send you a download button in the email when it is ready. -Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. +refine.bio will send you a download button in the email when it is ready. +Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. Double clicking should unzip this for you and create a folder of the same name. - + For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). The `` folder has the data and metadata TSV files you will need for this example analysis. -Experiment accession ids usually look something like `GSE1235` or `SRP12345`. +Experiment accession ids usually look something like `GSE1235` or `SRP12345`. Copy and paste the `GSE71270` folder into your newly created `data/` folder. ## Check out our file structure! -Your new analysis folder should contain: +Your new analysis folder should contain: - The example analysis `.Rmd` you downloaded - A folder called "data" which contains: @@ -110,13 +110,13 @@ Your new analysis folder should contain: - The metadata TSV - A folder for `plots` (currently empty) - A folder for `results` (currently empty) - -Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using): + +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): -In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. -Run this chunk to double check that your files are in the right place. +In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. +Run this chunk to double check that your files are in the right place. ```{r} # Define the file path to the data directory @@ -131,13 +131,13 @@ file.exists(file.path(data_dir, "metadata_GSE71270.tsv")) If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. -If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). # Using a different refine.bio dataset with this analysis? If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. -From here you can customize this analysis example to fit your own scientific questions and preferences. +From here you can customize this analysis example to fit your own scientific questions and preferences. *** @@ -182,7 +182,7 @@ library(ggplot2) ## Import and set up data -Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. +Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment. ```{r} @@ -197,11 +197,11 @@ df <- readr::read_tsv(file.path( data_dir, # Replace with path to your data file "GSE71270.tsv" # Replace with the name of your data file )) %>% - # Tuck away the Gene id column as rownames + # Tuck away the Gene ID column as rownames tibble::column_to_rownames("Gene") ``` -Let's ensure that the metadata and data are in the same sample order. +Let's ensure that the metadata and data are in the same sample order. ```{r} # Make the data in the order of the metadata @@ -214,11 +214,11 @@ all.equal(colnames(df), metadata$geo_accession) ## Set up design matrix -`limma` needs a numeric design matrix to signify which are CREB and control samples. +`limma` needs a numeric design matrix to signify which are CREB and control samples. Here we are using the treatments supplied in the metadata to create a design matrix where the "none" samples are assigned `0` and the "amputated" samples are assigned `1`. Note that the metadata variables that signify the treatment groups might be different across datasets and might not always be underneath the category. -The `genotype/variation` column contains group information we will be using for differential expression. +The `genotype/variation` column contains group information we will be using for differential expression. But the `/` it contains in its column name makes it more annoying to access. Accessing variable that have names with special characters like `/`, or spaces, require extra work-arounds to ignore R's normal interpretations of these characters. @@ -227,7 +227,7 @@ metadata <- metadata %>% dplyr::rename("genotype" = `genotype/variation`) # This step will not be the same (or might not be needed at all) with a different dataset ``` -Now we will create a model matrix based on our newly renamed `genotype` variable. +Now we will create a model matrix based on our newly renamed `genotype` variable. ```{r} # Create the design matrix from the genotype information @@ -236,7 +236,7 @@ des_mat <- model.matrix(~ metadata$genotype) ## Perform differential expression -After applying our data to linear model, in this example we apply empirical Bayes smoothing and Benjamini-Hochberg multiple testing correction. +After applying our data to linear model, in this example we apply empirical Bayes smoothing and Benjamini-Hochberg multiple testing correction. The `topTable()` function default is to use Benjamini-Hochberg but this can be changed to a different method using the `adjust.method` argument (see the `?topTable` help page for more about the options). ```{r} @@ -251,20 +251,20 @@ stats_df <- topTable(fit, number = nrow(df)) %>% tibble::rownames_to_column("Gene") ``` -Let's take a peek at what our results table looks like. +Let's take a peek at what our results table looks like. ```{r} head(stats_df) ``` -By default, results are ordered by largest `B` (the log odds value) to the smallest, which means your most differentially expressed genes should be toward the top. +By default, results are ordered by largest `B` (the log odds value) to the smallest, which means your most differentially expressed genes should be toward the top. See the help page by using `?topTable` for more information and options for this table. ## Check results by plotting one gene -To test if these results make sense, we can make a plot of one of top genes. -Let's try extracting the data for `ENSDARG00000104315` and set up its own data frame for plotting purposes. +To test if these results make sense, we can make a plot of one of top genes. +Let's try extracting the data for `ENSDARG00000104315` and set up its own data frame for plotting purposes. ```{r} top_gene_df <- df %>% @@ -284,7 +284,7 @@ top_gene_df <- df %>% )) ``` -Let's take a sneak peek at what our `top_gene_df` looks like. +Let's take a sneak peek at what our `top_gene_df` looks like. ```{r} top_gene_df @@ -299,11 +299,11 @@ ggplot(top_gene_df, aes(x = genotype, y = ENSDARG00000104315, color = genotype)) ``` These results make sense. -The overexpressing CREB group samples have much higher expression values for ENSDARG00000104315 than the control samples do. +The overexpressing CREB group samples have much higher expression values for ENSDARG00000104315 than the control samples do. ## Write results to file -The results in `stats_df` will be saved to our `results/` directory. +The results in `stats_df` will be saved to our `results/` directory. ```{r} readr::write_tsv(stats_df, file.path( @@ -325,10 +325,10 @@ EnhancedVolcano::EnhancedVolcano(stats_df, ``` In this plot, green points represent genes that meet the log2 fold change, by default the cutoff is absolute value of 1. -But there are no genes that meet the p value cutoff, which by default is `1e-05`. +But there are no genes that meet the p value cutoff, which by default is `1e-05`. We used the adjusted p values for our plot above, so you may want to adjust this with the `pCutoff` argument (Take a look at all the options for tailoring this plot using `?EnhancedVolcano`). -Let's make the same plot again, but adjust the `pCutoff` since we are using multiple-testing corrected p values, and this time we will assign the plot to our environment as `volcano_plot`. +Let's make the same plot again, but adjust the `pCutoff` since we are using multiple-testing corrected p values, and this time we will assign the plot to our environment as `volcano_plot`. ```{r} volcano_plot <- EnhancedVolcano::EnhancedVolcano(stats_df, @@ -342,7 +342,7 @@ volcano_plot <- EnhancedVolcano::EnhancedVolcano(stats_df, volcano_plot ``` -Let's save this plot to a PNG file. +Let's save this plot to a PNG file. ```{r} ggsave( @@ -361,11 +361,10 @@ ggsave( # Session info -At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of software and packages you used to run this. +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info sessionInfo() ``` - diff --git a/02-microarray/differential-expression_microarray_01_2-groups.html b/02-microarray/differential-expression_microarray_01_2-groups.html index 9a123286..51ae45fa 100644 --- a/02-microarray/differential-expression_microarray_01_2-groups.html +++ b/02-microarray/differential-expression_microarray_01_2-groups.html @@ -1752,7 +1752,7 @@

    2.6 Check out our file structure!
  • A folder for results (currently empty)
  • -

    Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using):

    +

    Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. Run this chunk to double check that your files are in the right place.

    # Define the file path to the data directory
    @@ -1838,7 +1838,7 @@ 

    4.2 Import and set up data

    data_dir, # Replace with path to your data file "GSE71270.tsv" # Replace with the name of your data file )) %>% - # Tuck away the Gene id column as rownames + # Tuck away the Gene ID column as rownames tibble::column_to_rownames("Gene")
    ## Parsed with column specification:
     ## cols(
    @@ -1942,7 +1942,7 @@ 

    4.5 Check results by plotting one
    ggplot(top_gene_df, aes(x = genotype, y = ENSDARG00000104315, color = genotype)) +
       geom_jitter(width = 0.2) + # We'll make this a jitter plot
       theme_classic() # This makes some aesthetic changes
    -

    +

    These results make sense. The overexpressing CREB group samples have much higher expression values for ENSDARG00000104315 than the control samples do.

    diff --git a/02-microarray/differential-expression_microarray_02_several-groups.Rmd b/02-microarray/differential-expression_microarray_02_several-groups.Rmd index 1d9ccf03..362b52d1 100644 --- a/02-microarray/differential-expression_microarray_02_several-groups.Rmd +++ b/02-microarray/differential-expression_microarray_02_several-groups.Rmd @@ -11,30 +11,30 @@ output: # Purpose of this analysis -This notebook takes data and metadata from refine.bio and identifies differentially expressed genes with more than 2 groups. +This notebook takes data and metadata from refine.bio and identifies differentially expressed genes with more than 2 groups. ⬇️ [**Jump to the analysis code**](#analysis) ⬇️ # How to run this example For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). -We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. ## Obtain the `.Rmd` file To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/02-microarray/differential-expression_microarray_02_several-groups.Rmd). You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) -Clicking this link will most likely send this to your downloads folder on your computer. +Clicking this link will most likely send this to your downloads folder on your computer. Move this `.Rmd` file to where you would like this example and its files to be stored. -## Set up your analysis folders +## Set up your analysis folders Good file organization is helpful for keeping your data analysis project on track! -We have set up some code that will automatically set up a folder structure for you. -Run this next chunk to set up your folders! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! -If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. ```{r} # Create the data folder if it doesn't exist @@ -63,7 +63,7 @@ In the same place you put this `.Rmd` file, you should now have three new empty ## Obtain the dataset from refine.bio -For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). +For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/GSE37418/novel-mutations-target-distinct-subgroups-of-medulloblastoma). @@ -76,32 +76,32 @@ Fill out the pop up window with your email and our Terms and Conditions: It may take a few minutes for the dataset to process. -You will get an email when it is ready. +You will get an email when it is ready. ## About the dataset we are using for this example For this example analysis, we will use this [medulloblastoma samples](https://www.refine.bio/experiments/GSE37418/novel-mutations-target-distinct-subgroups-of-medulloblastoma). -@Robinson2012 measured microarray gene expression of 71 medulloblastoma tumor samples. -In this analysis, we will test differential expression across the medulloblastoma subtypes. +@Robinson2012 measured microarray gene expression of 71 medulloblastoma tumor samples. +In this analysis, we will test differential expression across the medulloblastoma subtypes. ## Place the dataset in your new `data/` folder -refine.bio will send you a download button in the email when it is ready. -Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. +refine.bio will send you a download button in the email when it is ready. +Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. Double clicking should unzip this for you and create a folder of the same name. - + For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). The `` folder has the data and metadata TSV files you will need for this example analysis. -Experiment accession ids usually look something like `GSE1235` or `SRP12345`. +Experiment accession ids usually look something like `GSE1235` or `SRP12345`. Copy and paste the `GSE37418` folder into your newly created `data/` folder. ## Check out our file structure! -Your new analysis folder should contain: +Your new analysis folder should contain: - The example analysis `.Rmd` you downloaded - A folder called "data" which contains: @@ -110,13 +110,13 @@ Your new analysis folder should contain: - The metadata TSV - A folder for `plots` (currently empty) - A folder for `results` (currently empty) - -Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using): + +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): -In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. -Run this chunk to double check that your files are in the right place. +In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. +Run this chunk to double check that your files are in the right place. ```{r} # Define the file path to the data directory @@ -131,13 +131,13 @@ file.exists(file.path(data_dir, "metadata_GSE37418.tsv")) If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. -If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). # Using a different refine.bio dataset with this analysis? If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. -From here you can customize this analysis example to fit your own scientific questions and preferences. +From here you can customize this analysis example to fit your own scientific questions and preferences. *** @@ -173,7 +173,7 @@ library(ggplot2) ## Import and set up data -Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. +Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment. ```{r} @@ -188,23 +188,23 @@ df <- readr::read_tsv(file.path( data_dir, # Replace with path to your data file "GSE37418.tsv" # Replace with the name of your data file )) %>% - # Tuck away the Gene id column as rownames + # Tuck away the gene ID column as rownames tibble::column_to_rownames("Gene") ``` ## Removing groups that are too small We will be using the `subgroup` variable labels in our metadata to test differentially expression across. -Let's take a look at how many samples of each subgroup we have. +Let's take a look at how many samples of each subgroup we have. ```{r} metadata %>% dplyr::count(subgroup) ``` -Looks like there is one sample that has been labeled by the authors as an outlier (`SHH OUTLIIER`), as well as one group, `U`, that only has two samples. -We will probably want to remove the `U` samples and this outlier since their inclusion might throw off our differential expression analysis results. +Looks like there is one sample that has been labeled by the authors as an outlier (`SHH OUTLIIER`), as well as one group, `U`, that only has two samples. +We will probably want to remove the `U` samples and this outlier since their inclusion might throw off our differential expression analysis results. -Let's start out by removing the outlier and the `U` group, we can do this all at once by removing groups smaller than 3. +Let's start out by removing the outlier and the `U` group, we can do this all at once by removing groups smaller than 3. ```{r} filtered_metadata <- metadata %>% @@ -213,15 +213,15 @@ filtered_metadata <- metadata %>% dplyr::ungroup() ``` -Let's take a look at the subgroup summary again. +Let's take a look at the subgroup summary again. ```{r} metadata %>% dplyr::count(subgroup) ``` -Note that the `U` and the `SHH OUTLIER` samples are gone and only the four groups we are interested in are left. +Note that the `U` and the `SHH OUTLIER` samples are gone and only the four groups we are interested in are left. -But, we still need to filter these samples out from the expression data that's stored in `df`. +But, we still need to filter these samples out from the expression data that's stored in `df`. ```{r} # Make the data in the order of the metadata @@ -234,37 +234,37 @@ all.equal(colnames(df), filtered_metadata$geo_accession) ## Create the design matrix -`limma` needs a numeric design matrix to signify which samples are of which subtype of medulloblastoma. +`limma` needs a numeric design matrix to signify which samples are of which subtype of medulloblastoma. Now we will create a model matrix based on our `subgroup` variable. -We are using a `+ 0` in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. -If you have a control group, you might want that to be the intercept. +We are using a `+ 0` in the model which sets the intercept to 0 so the subgroup effects capture expression for that group, rather than difference from the first group. +If you have a control group, you might want that to be the intercept. ```{r} # Create the design matrix des_mat <- model.matrix(~ filtered_metadata$subgroup + 0) ``` -Let's take a look at the design matrix we created. +Let's take a look at the design matrix we created. ```{r} # Print out the design matrix head(des_mat) ``` -The design matrix column names are a bit messy, so we will neaten them up by dropping the `filtered_metadata$subgroup` designation they all have. +The design matrix column names are a bit messy, so we will neaten them up by dropping the `filtered_metadata$subgroup` designation they all have. ```{r} # Make the column names less messy colnames(des_mat) <- stringr::str_remove(colnames(des_mat), "filtered_metadata\\$subgroup") ``` -Side note: If you are wondering why there are two `\` above in `"filtered_metadata\\$subgroup"`, that's called an [escape character](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html#escaping). -There's a whole universe of things called [regular expressions (regex)](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html) that can be super handy for string manipulations. +Side note: If you are wondering why there are two `\` above in `"filtered_metadata\\$subgroup"`, that's called an [escape character](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html#escaping). +There's a whole universe of things called [regular expressions (regex)](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html) that can be super handy for string manipulations. ## Perform differential expression -Now we are ready to actually start fitting our differential expression model to the data. -To accommodate our design that has more than 2 groups this time, we will need to do this in a couple steps. +Now we are ready to actually start fitting our differential expression model to the data. +To accommodate our design that has more than 2 groups this time, we will need to do this in a couple steps. First we need to fit our basic linear model to the data, then apply empirical Bayes smoothing. @@ -276,9 +276,9 @@ fit <- lmFit(df, design = des_mat) fit <- eBayes(fit) ``` -Now that we have our basic model fitting, we will want to make the contrasts among all our groups. +Now that we have our basic model fitting, we will want to make the contrasts among all our groups. Depending on your scientific questions, you will need to customize the next steps. -Consulting the [limma users guide](https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf) for how to set up your model based on your hypothesis is a good idea. +Consulting the [limma users guide](https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf) for how to set up your model based on your hypothesis is a good idea. In this contrasts matrix, we are comparing each subtype to all the other subtypes. We're dividing by three in this expression so that each group is compared to the average of the other three groups (`makeContrasts()` doesn't allow you to use functions like `mean()`; it wants a formula). @@ -293,8 +293,8 @@ contrast_matrix <- makeContrasts( ) ``` -Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulate would look like `G3 = G3 - Control` for each one. -We highly recommend consulting the [limma users guide](https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf) for figuring out what your `makeContrasts()` and `model.matrix()` setups should look like [@Ritchie2015]. +Side note: If you did have a control group you wanted to compare each group to, you could make each contrast to that control group, so the formulate would look like `G3 = G3 - Control` for each one. +We highly recommend consulting the [limma users guide](https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf) for figuring out what your `makeContrasts()` and `model.matrix()` setups should look like [@Ritchie2015]. Now that we have the contrasts matrix set up, we can use it to re-fit the model and re-smooth it with `eBayes()`. @@ -308,9 +308,9 @@ contrasts_fit <- eBayes(contrasts_fit) Here's a [nifty article and example](http://varianceexplained.org/r/empirical_bayes_baseball/) about what the empirical Bayes smoothing is for [@bayes-estimates]. -Now let's create the results table based on the contrasts fitted model. +Now let's create the results table based on the contrasts fitted model. -This step will provide the Benjamini-Hochberg multiple testing correction. +This step will provide the Benjamini-Hochberg multiple testing correction. The `topTable()` function default is to use Benjamini-Hochberg but this can be changed to a different method using the `adjust.method` argument (see the `?topTable` help page for more about the options). ```{r} @@ -319,25 +319,25 @@ stats_df <- topTable(contrasts_fit, number = nrow(df)) %>% tibble::rownames_to_column("Gene") ``` -Let's take a peek at our results table. +Let's take a peek at our results table. ```{r} head(stats_df) ``` -For each gene, each group's fold change in expression, compared to the average of the other groups is reported. +For each gene, each group's fold change in expression, compared to the average of the other groups is reported. -By default, results are ordered from largest `F` value to the smallest, which means your most differentially expressed genes across all groups should be toward the top. +By default, results are ordered from largest `F` value to the smallest, which means your most differentially expressed genes across all groups should be toward the top. See the help page by using `?topTable` for more information and options for this table. ## Check results by plotting one gene -To test if these results make sense, we can make a plot of one of top genes. -Let's try extracting the data for `ENSG00000128683` and set up its own data frame for plotting purposes. -Based on the results in `stats_df`, we should expect this gene to be much higher in the `WNT` samples. +To test if these results make sense, we can make a plot of one of top genes. +Let's try extracting the data for `ENSG00000128683` and set up its own data frame for plotting purposes. +Based on the results in `stats_df`, we should expect this gene to be much higher in the `WNT` samples. -First we will need to set up the data for this gene and the subgroup labels into a data frame for plotting. +First we will need to set up the data for this gene and the subgroup labels into a data frame for plotting. ```{r} top_gene_df <- df %>% @@ -357,14 +357,14 @@ top_gene_df <- df %>% )) ``` -Let's take a sneak peek at our `top_gene_df`. +Let's take a sneak peek at our `top_gene_df`. ```{r} head(top_gene_df) ``` Now let's plot the data for `ENSG00000128683` using our `top_gene_df`. -We should expect this gene to be expressed at much higher levels in the `WNT` group samples. +We should expect this gene to be expressed at much higher levels in the `WNT` group samples. ```{r} ggplot(top_gene_df, aes(x = subgroup, y = ENSG00000128683, color = subgroup)) + @@ -373,11 +373,11 @@ ggplot(top_gene_df, aes(x = subgroup, y = ENSG00000128683, color = subgroup)) + ``` Yes! These results make sense. -The WNT samples have much higher expression of ENSG00000128683 than the other samples. +The WNT samples have much higher expression of ENSG00000128683 than the other samples. ## Write results to file -The results in `stats_df` will be saved to our `results/` directory. +The results in `stats_df` will be saved to our `results/` directory. ```{r} readr::write_tsv(stats_df, file.path( @@ -389,10 +389,10 @@ readr::write_tsv(stats_df, file.path( ## Make volcano plots We'll use the `ggplot2` to make a set of volcano plots. -But first, we need to set up our data for plotting. -We will need the p values from the individual contrasts as well as the log fold changes. +But first, we need to set up our data for plotting. +We will need the p values from the individual contrasts as well as the log fold changes. -We can obtain the contrast p values from the `contrasts_fit` object and make it a longer format that the `ggplot()` function will want for plotting. +We can obtain the contrast p values from the `contrasts_fit` object and make it a longer format that the `ggplot()` function will want for plotting. ```{r} # Let's extract the contrast p values for each and convert them to -log10() @@ -422,7 +422,7 @@ log_fc_df <- stats_df %>% ) ``` -We can perform an `inner_join()` of both these datasets using both their `Gene` and `contrast` columns. +We can perform an `inner_join()` of both these datasets using both their `Gene` and `contrast` columns. ```{r} plot_df <- log_fc_df %>% @@ -441,8 +441,8 @@ Let's print out a preview of `plot_df`. head(plot_df) ``` -Let's declare what we consider to be significant levels for fold change and for -log10 p-values. -By saving this as its own variable, we only need to change these cutoffs in one place if we want to adjust later. +Let's declare what we consider to be significant levels for fold change and for -log10 p-values. +By saving this as its own variable, we only need to change these cutoffs in one place if we want to adjust later. ```{r} # This is equivalent to p value < 0.05 @@ -452,8 +452,8 @@ p_val_cutoff <- 1.301 abs_fc_cutoff <- 5 ``` -Now we can use these cutoffs to make a new variable that declares which genes we consider significant. -We will use some logic with `dplyr::case_when()` to do this. +Now we can use these cutoffs to make a new variable that declares which genes we consider significant. +We will use some logic with `dplyr::case_when()` to do this. ```{r} plot_df <- plot_df %>% @@ -497,10 +497,10 @@ volcanoes_plot <- ggplot( volcanoes_plot ``` -Here the green points _might_ be of interest. -We recommend [ColorBrewer](https://colorbrewer2.org/) for finding different color sets if you don't like the ones we used. +Here the green points _might_ be of interest. +We recommend [ColorBrewer](https://colorbrewer2.org/) for finding different color sets if you don't like the ones we used. -Let's save these volcanoes to a PNG file. +Let's save these volcanoes to a PNG file. ```{r} ggsave( @@ -520,8 +520,8 @@ ggsave( # Session info -At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of software and packages you used to run this. +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. ```{r} # Print session info diff --git a/02-microarray/differential-expression_microarray_02_several-groups.html b/02-microarray/differential-expression_microarray_02_several-groups.html index 7ffc212e..0136d22f 100644 --- a/02-microarray/differential-expression_microarray_02_several-groups.html +++ b/02-microarray/differential-expression_microarray_02_several-groups.html @@ -1752,7 +1752,7 @@

    2.6 Check out our file structure!
  • A folder for results (currently empty)
  • -

    Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using):

    +

    Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. Run this chunk to double check that your files are in the right place.

    # Define the file path to the data directory
    @@ -1831,7 +1831,7 @@ 

    4.2 Import and set up data

    data_dir, # Replace with path to your data file "GSE37418.tsv" # Replace with the name of your data file )) %>% - # Tuck away the Gene id column as rownames + # Tuck away the gene ID column as rownames tibble::column_to_rownames("Gene")
    ## Parsed with column specification:
     ## cols(
    @@ -1992,7 +1992,7 @@ 

    4.6 Check results by plotting one
    ggplot(top_gene_df, aes(x = subgroup, y = ENSG00000128683, color = subgroup)) +
       geom_jitter(width = 0.2) + # We'll make this a jitter plot
       theme_classic() # This makes some aesthetic changes
    -

    +

    Yes! These results make sense. The WNT samples have much higher expression of ENSG00000128683 than the other samples.

    diff --git a/02-microarray/dimension-reduction_microarray_01_pca.Rmd b/02-microarray/dimension-reduction_microarray_01_pca.Rmd index 95518e30..e99cee59 100644 --- a/02-microarray/dimension-reduction_microarray_01_pca.Rmd +++ b/02-microarray/dimension-reduction_microarray_01_pca.Rmd @@ -115,7 +115,7 @@ Your new analysis folder should contain: - A folder for `plots` (currently empty) - A folder for `results` (currently empty) -Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using): +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): @@ -181,7 +181,7 @@ df <- readr::read_tsv(file.path( data_dir, # Replace with path to your data file "GSE37382.tsv" # Replace with the name of your data file )) %>% - # Tuck away the Gene id column as rownames + # Tuck away the gene ID column as rownames tibble::column_to_rownames("Gene") ``` diff --git a/02-microarray/dimension-reduction_microarray_01_pca.html b/02-microarray/dimension-reduction_microarray_01_pca.html index 544a0cd2..58c554fd 100644 --- a/02-microarray/dimension-reduction_microarray_01_pca.html +++ b/02-microarray/dimension-reduction_microarray_01_pca.html @@ -1753,7 +1753,7 @@

    2.6 Check out our file structure!
  • A folder for results (currently empty)
  • -

    Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using):

    +

    Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. Run this chunk to double check that your files are in the right place.

    # Define the file path to the data directory
    @@ -1823,7 +1823,7 @@ 

    4.2 Import and set up data

    data_dir, # Replace with path to your data file "GSE37382.tsv" # Replace with the name of your data file )) %>% - # Tuck away the Gene id column as rownames + # Tuck away the gene ID column as rownames tibble::column_to_rownames("Gene")
    ## Parsed with column specification:
     ## cols(
    @@ -2319,9 +2319,9 @@ 

    6 Session info

    ## [25] vctrs_0.3.4 hms_0.5.3 tidyselect_1.1.0 grid_4.0.2 ## [29] getopt_1.20.3 glue_1.4.2 R6_2.4.1 fansi_0.4.1 ## [33] rmarkdown_2.3 farver_2.0.3 purrr_0.3.4 readr_1.3.1 -## [37] backports_1.1.10 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0 -## [41] assertthat_0.2.1 colorspace_1.4-1 labeling_0.3 stringi_1.5.3 -## [45] munsell_0.5.0 crayon_1.3.4 R.oo_1.24.0
    +## [37] rematch2_2.1.2 scales_1.1.1 backports_1.1.10 ellipsis_0.3.1 +## [41] htmltools_0.5.0 assertthat_0.2.1 colorspace_1.4-1 labeling_0.3 +## [45] stringi_1.5.3 munsell_0.5.0 crayon_1.3.4 R.oo_1.24.0

    Brems M., 2017 A one-stop shop for principal component analysis

    diff --git a/02-microarray/dimension-reduction_microarray_02_umap.Rmd b/02-microarray/dimension-reduction_microarray_02_umap.Rmd index 86a26000..33346c4e 100644 --- a/02-microarray/dimension-reduction_microarray_02_umap.Rmd +++ b/02-microarray/dimension-reduction_microarray_02_umap.Rmd @@ -114,7 +114,7 @@ Your new analysis folder should contain: - A folder for `plots` (currently empty) - A folder for `results` (currently empty) -Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using): +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): @@ -201,7 +201,7 @@ df <- readr::read_tsv(file.path( data_dir, # Replace with path to your data file "GSE37382.tsv" # Replace with the name of your data file )) %>% - # Tuck away the Gene id column as rownames + # Tuck away the gene ID column as rownames tibble::column_to_rownames("Gene") ``` diff --git a/02-microarray/dimension-reduction_microarray_02_umap.html b/02-microarray/dimension-reduction_microarray_02_umap.html index a626168e..7fed83fb 100644 --- a/02-microarray/dimension-reduction_microarray_02_umap.html +++ b/02-microarray/dimension-reduction_microarray_02_umap.html @@ -1753,7 +1753,7 @@

    2.6 Check out our file structure!
  • A folder for results (currently empty)
  • -

    Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using):

    +

    Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. Run this chunk to double check that your files are in the right place.

    # Define the file path to the data directory
    @@ -1835,7 +1835,7 @@ 

    4.2 Import and set up data

    data_dir, # Replace with path to your data file "GSE37382.tsv" # Replace with the name of your data file )) %>% - # Tuck away the Gene id column as rownames + # Tuck away the gene ID column as rownames tibble::column_to_rownames("Gene")
    ## Parsed with column specification:
     ## cols(
    @@ -1978,20 +1978,20 @@ 

    6 Session info

    ## [1] magrittr_1.5 ggplot2_3.3.2 umap_0.2.6.0 optparse_1.6.6 ## ## loaded via a namespace (and not attached): -## [1] Rcpp_1.0.5 RSpectra_0.16-0 pillar_1.4.6 compiler_4.0.2 -## [5] R.methodsS3_1.8.1 R.utils_2.10.1 tools_4.0.2 digest_0.6.25 -## [9] gtable_0.3.0 jsonlite_1.7.1 lattice_0.20-41 evaluate_0.14 -## [13] lifecycle_0.2.0 tibble_3.0.3 R.cache_0.14.0 pkgconfig_2.0.3 -## [17] rlang_0.4.7 Matrix_1.2-18 cli_2.0.2 rstudioapi_0.11 -## [21] yaml_2.2.1 xfun_0.17 withr_2.3.0 dplyr_1.0.2 -## [25] styler_1.3.2 stringr_1.4.0 knitr_1.30 generics_0.0.2 -## [29] askpass_1.1 vctrs_0.3.4 hms_0.5.3 tidyselect_1.1.0 -## [33] grid_4.0.2 getopt_1.20.3 reticulate_1.16 glue_1.4.2 -## [37] R6_2.4.1 fansi_0.4.1 rmarkdown_2.3 farver_2.0.3 -## [41] purrr_0.3.4 readr_1.3.1 scales_1.1.1 backports_1.1.10 -## [45] ellipsis_0.3.1 htmltools_0.5.0 assertthat_0.2.1 colorspace_1.4-1 -## [49] labeling_0.3 stringi_1.5.3 munsell_0.5.0 openssl_1.4.3 -## [53] crayon_1.3.4 R.oo_1.24.0
    +## [1] reticulate_1.16 styler_1.3.2 tidyselect_1.1.0 xfun_0.17 +## [5] rematch2_2.1.2 purrr_0.3.4 lattice_0.20-41 colorspace_1.4-1 +## [9] vctrs_0.3.4 generics_0.0.2 htmltools_0.5.0 getopt_1.20.3 +## [13] yaml_2.2.1 rlang_0.4.7 R.oo_1.24.0 pillar_1.4.6 +## [17] glue_1.4.2 withr_2.3.0 R.utils_2.10.1 R.cache_0.14.0 +## [21] lifecycle_0.2.0 stringr_1.4.0 munsell_0.5.0 gtable_0.3.0 +## [25] R.methodsS3_1.8.1 evaluate_0.14 labeling_0.3 knitr_1.30 +## [29] fansi_0.4.1 Rcpp_1.0.5 readr_1.3.1 openssl_1.4.3 +## [33] backports_1.1.10 scales_1.1.1 jsonlite_1.7.1 farver_2.0.3 +## [37] RSpectra_0.16-0 hms_0.5.3 askpass_1.1 digest_0.6.25 +## [41] stringi_1.5.3 dplyr_1.0.2 grid_4.0.2 cli_2.0.2 +## [45] tools_4.0.2 tibble_3.0.3 crayon_1.3.4 pkgconfig_2.0.3 +## [49] ellipsis_0.3.1 Matrix_1.2-18 assertthat_0.2.1 rmarkdown_2.3 +## [53] rstudioapi_0.11 R6_2.4.1 compiler_4.0.2

    Konopka T., 2020 Uniform manifold approximation and projection.

    diff --git a/02-microarray/pathway_analysis_microarray_01_ortholog_mapping_kegg.Rmd b/02-microarray/pathway_analysis_microarray_01_ortholog_mapping_kegg.Rmd index 2960f834..99bd0f68 100644 --- a/02-microarray/pathway_analysis_microarray_01_ortholog_mapping_kegg.Rmd +++ b/02-microarray/pathway_analysis_microarray_01_ortholog_mapping_kegg.Rmd @@ -1,6 +1,6 @@ --- title: "KEGG pathways: mapping to mouse orthologs with `hcop`" -output: +output: html_notebook: toc: TRUE toc_float: TRUE @@ -15,8 +15,8 @@ for pathway analysis (implemented in the [`qusage` bioconductor package](https:/ `qusage` allows you to read in gene sets that are in the [GMT format](http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). -[MSigDB](http://software.broadinstitute.org/gsea/msigdb) offers genesets in this format. -[Curated gene sets](http://software.broadinstitute.org/gsea/msigdb/collections.jsp#C2) +[MSigDB](http://software.broadinstitute.org/gsea/msigdb) offers gene sets in this format. +[Curated gene sets](http://software.broadinstitute.org/gsea/msigdb/collections.jsp#C2) such as [KEGG](https://www.genome.jp/kegg/) are a good starting point for any pathway analysis. However, MSigDB only distributes human pathways. @@ -24,7 +24,7 @@ If we want to use KEGG Pathways with another species without going through [KEGG Orthology](https://www.genome.jp/kegg/ko.html), we need to map to orthologs ourselves. -We'll use the [`hcop` package](https://github.com/stephenturner/hcop) to do +We'll use the [`hcop` package](https://github.com/stephenturner/hcop) to do this. If you're looking for a little bit more background information (like if you run into trouble installing `hcop`), check out the notebook in our @@ -150,7 +150,7 @@ mouse_ortholog_df %>% ) ``` -We can see that the human gene _PRPS1_/`5631` maps to 4 mouse genes, one of +We can see that the human gene _PRPS1_/`5631` maps to 4 mouse genes, one of which has 10 resources supporting that mapping. ```{r} @@ -212,7 +212,7 @@ Briefly, the GMT format has one pathway per line and it follows this pattern: \t\t... ``` -We've lost the description information because it's removed by +We've lost the description information because it's removed by `qusage::read.gmt`. The description in `r kegg_file` follows this pattern: diff --git a/02-microarray/pathway_analysis_microarray_02_ora_with_webgestaltr.Rmd b/02-microarray/pathway_analysis_microarray_02_ora_with_webgestaltr.Rmd index 4313bb93..e67286bf 100644 --- a/02-microarray/pathway_analysis_microarray_02_ora_with_webgestaltr.Rmd +++ b/02-microarray/pathway_analysis_microarray_02_ora_with_webgestaltr.Rmd @@ -26,7 +26,7 @@ If you are interested in performing pathway analysis on a small study, ORA may be your best bet. There are some limitations to ORA methods to be aware such as ignoring gene-gene correlation. -See [ et al. _PLoS Comp Bio._ 2012.](https://doi.org/10.1371/journal.pcbi.1002375) +See [ et al. _PLOS Comp Bio._ 2012.](https://doi.org/10.1371/journal.pcbi.1002375) to learn more about the different types of pathway analysis and their limitations. diff --git a/02-microarray/pathway_analysis_microarray_03_qusage_meta_analysis.Rmd b/02-microarray/pathway_analysis_microarray_03_qusage_meta_analysis.Rmd index dd62f890..49cbd4be 100644 --- a/02-microarray/pathway_analysis_microarray_03_qusage_meta_analysis.Rmd +++ b/02-microarray/pathway_analysis_microarray_03_qusage_meta_analysis.Rmd @@ -21,7 +21,7 @@ If we're interested in pathway analysis of multiple datasets, QuSAGE allows us to perform a _meta-analysis_ by combining distributions from the QuSAGE results from each dataset. Meta-analysis with QuSAGE is described in -[ et al. _PLoS Comp Bio._ 2019.](https://doi.org/10.1371/journal.pcbi.1006899) +[ et al. _PLOS Comp Bio._ 2019.](https://doi.org/10.1371/journal.pcbi.1006899) and implemented in the [`qusage` bioconductor package](https://bioconductor.org/packages/release/bioc/html/qusage.html). The [`qusage` vignette](https://bioconductor.org/packages/release/bioc/vignettes/qusage/inst/doc/qusage.pdf) contains a section on meta-analysis. diff --git a/03-rnaseq/00-intro-to-rnaseq.Rmd b/03-rnaseq/00-intro-to-rnaseq.Rmd index 0d0eb707..821bc71e 100644 --- a/03-rnaseq/00-intro-to-rnaseq.Rmd +++ b/03-rnaseq/00-intro-to-rnaseq.Rmd @@ -2,7 +2,7 @@ title: "Introduction to RNA-seq" author: "CCDL for ALSF" output: - html_notebook: + html_notebook: toc: true toc_float: true --- @@ -29,15 +29,15 @@ output: ## Introduction to RNA-seq technology -Data analyses are generally not "one size fits all"; this is particularly true between RNA-seq vs microarray data. -This tutorial has example analyses [organized by technology](../01-getting-started/getting-started.html#about-how-this-tutorial-book-is-structured) so you can follow examples that are more closely tailored to the nature of the data at hand. +Data analyses are generally not "one size fits all"; this is particularly true between RNA-seq vs microarray data. +This tutorial has example analyses [organized by technology](../01-getting-started/getting-started.html#about-how-this-tutorial-book-is-structured) so you can follow examples that are more closely tailored to the nature of the data at hand. -As with all experimental methods, RNA-seq has strengths and limitations that you should consider in regards to your scientific questions. +As with all experimental methods, RNA-seq has strengths and limitations that you should consider in regards to your scientific questions. ### RNA-seq data **strengths** - RNA-seq can assay unknown transcripts, as it is not bound to a pre-determined set of probes like microarrays [@Zhong2009]. -- Its values are considered more dynamic than microarray values which are constrained to a smaller range based on background signal and probesets being saturated [@Zhong2009]. +- Its values are considered more dynamic than microarray values which are constrained to a smaller range based on background signal and probe sets being saturated [@Zhong2009]. ### RNA-seq data **limitations/biases** @@ -49,8 +49,8 @@ The nature of sequencing introduces several different kinds of biases: - **Library size or sequencing depth**: the total number of reads is not always equivalent between samples. - **Gene length**: longer genes are more likely to be observed. -@bias-blog discusses these biases in this [blog post](https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/) which includes this handy figure. - +@bias-blog discusses these biases in this [blog post](https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/) which includes this handy figure. + Most normalization methods, including [refine.bio's processing methods](http://docs.refine.bio/en/latest/main_text.html#rna-seq-pipelines), attempt to mitigate these biases, but these biases can never be fully negated. @@ -60,13 +60,13 @@ In brief, refine.bio data is quantified by Salmon using their correction algorit ### About quantile normalization refine.bio data is available for you [quantile normalized](https://en.wikipedia.org/wiki/Quantile_normalization), which can address some library size biases. -But more often than not, our example modules will recommend using the option for downloading non-quantile normalized data (note that this is RNA-seq specific, and microarray data does not have this download option). +But more often than not, our example modules will recommend using the option for downloading non-quantile normalized data (note that this is RNA-seq specific, and microarray data does not have this download option). -See here for more about the [quantile normalization process in refine.bio](http://docs.refine.bio/en/latest/main_text.html#quantile-normalization). +See here for more about the [quantile normalization process in refine.bio](http://docs.refine.bio/en/latest/main_text.html#quantile-normalization). -### More resources on RNA-seq technology +### More resources on RNA-seq technology - [StatQuest: A gentle introduction to RNA-seq](https://www.youtube.com/watch?v=tlf6wYJrwKY) [@Starmer2017-rnaseq]. - [A general background on the wet lab methods of RNA-seq](https://bitesizebio.com/13542/what-everyone-should-know-about-rna-seq/) [@Hadfield2016]. @@ -83,26 +83,26 @@ We generally like DESeq2 because it has [great documentation and helpful tutoria ### DESeq2 objects -Many R Bioconductor packages have specialized object types they want your data to be formatted as. -For DESeq2, before we can use a lot the special functions, we need to get our data into a [`DESeqDataSet` object](https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/DESeqDataSet-class). -`DESeqDataSet` objects not only store your data, but additional transformations of your data, model information, etc. +Many R Bioconductor packages have specialized object types they want your data to be formatted as. +For DESeq2, before we can use a lot the special functions, we need to get our data into a [`DESeqDataSet` object](https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/DESeqDataSet-class). +`DESeqDataSet` objects not only store your data, but additional transformations of your data, model information, etc. -From our refine.bio datasets, we will use a function `DESeqDataSetFromMatrix()` to create our [`DESeqDataSet` objects](https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/DESeqDataSet-class). +From our refine.bio datasets, we will use a function `DESeqDataSetFromMatrix()` to create our [`DESeqDataSet` objects](https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/DESeqDataSet-class). This DESeq2 function requires you provide counts and *not* a normalized or corrected value like [TPMs](https://www.youtube.com/watch?v=TTUrtCY2k-w). Which is why our examples advise downloading [non-quantile normalized](#about-quantile-normalization) from refine.bio. ### DESeq2 transformation methods -Our examples recommend using DESeq2 for normalizing your RNA-seq data. -You may have heard about or worked with FPKM, TPM, RPKMs; how does DESeq2's normalization compare? +Our examples recommend using DESeq2 for normalizing your RNA-seq data. +You may have heard about or worked with FPKM, TPM, RPKMs; how does DESeq2's normalization compare? This [handy table from an online Harvard Bioinformatics Core course nicely summarizes and compares these different methods](https://hbctraining.github.io/DGE_workshop_salmon/lessons/02_DGE_count_normalization.html#common-normalization-methods) [@dge-workshop-deseq2]. -For more about the steps behind DESeq2 normalization, we highly recommend this [StatQuest video](https://www.youtube.com/watch?v=UFB993xufUU) which explains it quite nicely [@Starmer2017-deseq2]. +For more about the steps behind DESeq2 normalization, we highly recommend this [StatQuest video](https://www.youtube.com/watch?v=UFB993xufUU) which explains it quite nicely [@Starmer2017-deseq2]. -To normalize and transform our data with DESeq2, we generally use `vst()` (variance stabilizing transformation) or `rlog()` (regularized logarithm transformation). -[Both methods are very similar](http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#the-variance-stabilizing-transformation-and-the-rlog). +To normalize and transform our data with DESeq2, we generally use `vst()` (variance stabilizing transformation) or `rlog()` (regularized logarithm transformation). +[Both methods are very similar](http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#the-variance-stabilizing-transformation-and-the-rlog). Both _normalize_ your data by correcting for library size differences but they also _transform_ your data [removing the dependence of the variance on the mean](https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#count-data-transformations), meaning that low mean genes won't have inflated variance from just one or a few samples having higher values than the rest [@Love2020]. Of the two methods, `rlog()` takes a bit longer to run [@Love2019]. -If you end up using a larger dataset and `rlog()` transformation takes a bit too long, you can switch to using `vst()` with confidence since they yield similar results given the dataset is large enough [@Love2019]. +If you end up using a larger dataset and `rlog()` transformation takes a bit too long, you can switch to using `vst()` with confidence since they yield similar results given the dataset is large enough [@Love2019]. ### Further resources for DESeq2 @@ -120,15 +120,15 @@ But if the gene you are interested in does not have an Ensembl ID according to t #### What about edgeR? -In short, both edgeR and DESeq2 are good options and we at the CCDL just went with one of our preferences! [See this blog that summarizes these – by one of the creators of DESeq2](https://mikelove.wordpress.com/2016/09/28/deseq2-or-edger/) – he agrees edgeR is also great. +In short, both edgeR and DESeq2 are good options and we at the CCDL just went with one of our preferences! [See this blog that summarizes these – by one of the creators of DESeq2](https://mikelove.wordpress.com/2016/09/28/deseq2-or-edger/) – he agrees edgeR is also great. -If you have strong preferences for edgeR, you can definitely use your refine.bio data with it, but we currently do not have examples of that. -In this case, we'd refer you to [edgeR's section of this example analysis](https://kasperdanielhansen.github.io/genbioconductor/html/Count_Based_RNAseq.html) and wish you the best of luck on your data adventures [@count-based]! +If you have strong preferences for edgeR, you can definitely use your refine.bio data with it, but we currently do not have examples of that. +In this case, we'd refer you to [edgeR's section of this example analysis](https://kasperdanielhansen.github.io/genbioconductor/html/Count_Based_RNAseq.html) and wish you the best of luck on your data adventures [@count-based]! #### What if I care about isoforms? -Unfortunately at this time, all download-ready refine.bio data is summarized to the gene level, and there's no great way to examine isoforms with this data. -If your research needs to know transcript isoform information, you may need to look elsewhere. +Unfortunately at this time, all download-ready refine.bio data is summarized to the gene level, and there's no great way to examine isoforms with this data. +If your research needs to know transcript isoform information, you may need to look elsewhere. This [paper discusses some tools for these kinds of questions](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4002-1) [@Zhang2017]. diff --git a/03-rnaseq/00-intro-to-rnaseq.html b/03-rnaseq/00-intro-to-rnaseq.html index 0d5b0cbd..9155d564 100644 --- a/03-rnaseq/00-intro-to-rnaseq.html +++ b/03-rnaseq/00-intro-to-rnaseq.html @@ -1703,7 +1703,7 @@

    0.1 Introduction to RNA-seq techn

    0.1.1 RNA-seq data strengths

    • RNA-seq can assay unknown transcripts, as it is not bound to a pre-determined set of probes like microarrays (Wang et al.).
    • -
    • Its values are considered more dynamic than microarray values which are constrained to a smaller range based on background signal and probesets being saturated (Wang et al.).
    • +
    • Its values are considered more dynamic than microarray values which are constrained to a smaller range based on background signal and probe sets being saturated (Wang et al.).
    diff --git a/03-rnaseq/clustering_rnaseq_01_heatmap.Rmd b/03-rnaseq/clustering_rnaseq_01_heatmap.Rmd index b924160e..ad75a277 100644 --- a/03-rnaseq/clustering_rnaseq_01_heatmap.Rmd +++ b/03-rnaseq/clustering_rnaseq_01_heatmap.Rmd @@ -117,7 +117,7 @@ Your new analysis folder should contain: - A folder for `plots` (currently empty) - A folder for `results` (currently empty) -Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using): +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): @@ -313,7 +313,7 @@ We've created a heatmap but although our genes and samples are clustered, there First let's save our clustered heatmap. ### Save heatmap as a PNG -You can easily switch this to save to a jpeg or tiff by changing the function and file name within the function to the respective file suffix. +You can easily switch this to save to a JPEG or tiff by changing the function and file name within the function to the respective file suffix. ```{r} # Open a PNG file diff --git a/03-rnaseq/clustering_rnaseq_01_heatmap.html b/03-rnaseq/clustering_rnaseq_01_heatmap.html index f345bbe6..1f512cfd 100644 --- a/03-rnaseq/clustering_rnaseq_01_heatmap.html +++ b/03-rnaseq/clustering_rnaseq_01_heatmap.html @@ -1754,7 +1754,7 @@

    2.6 Check out our file structure!
  • A folder for plots (currently empty)
  • A folder for results (currently empty)
  • -

    Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using):

    +

    Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. Run this chunk to double check that your files are in the right place.

    # Define the file path to the data directory
    @@ -1985,7 +1985,7 @@ 

    4.7 Create a heatmap

    First let’s save our clustered heatmap.

    4.7.1 Save heatmap as a PNG

    -

    You can easily switch this to save to a jpeg or tiff by changing the function and file name within the function to the respective file suffix.

    +

    You can easily switch this to save to a JPEG or tiff by changing the function and file name within the function to the respective file suffix.

    # Open a PNG file
     png(file.path(
       plots_dir,
    @@ -2127,11 +2127,11 @@ 

    6 Session info

    ## [49] munsell_0.5.0 AnnotationDbi_1.50.3 compiler_4.0.2 ## [52] rlang_0.4.7 grid_4.0.2 RCurl_1.98-1.2 ## [55] rstudioapi_0.11 bitops_1.0-6 rmarkdown_2.3 -## [58] gtable_0.3.0 DBI_1.1.0 rematch2_2.1.2 -## [61] R6_2.4.1 knitr_1.30 dplyr_1.0.2 -## [64] utf8_1.1.4 bit_4.0.4 readr_1.3.1 -## [67] stringi_1.5.3 Rcpp_1.0.5 vctrs_0.3.4 -## [70] geneplotter_1.66.0 tidyselect_1.1.0 xfun_0.17
    +## [58] gtable_0.3.0 DBI_1.1.0 R6_2.4.1 +## [61] knitr_1.30 dplyr_1.0.2 utf8_1.1.4 +## [64] bit_4.0.4 readr_1.3.1 stringi_1.5.3 +## [67] Rcpp_1.0.5 vctrs_0.3.4 geneplotter_1.66.0 +## [70] tidyselect_1.1.0 xfun_0.17

    Gu Z., R. Eils, and M. Schlesner, 2016 Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics.

    diff --git a/03-rnaseq/differential-expression_rnaseq_01.Rmd b/03-rnaseq/differential-expression_rnaseq_01.Rmd index 22a4870c..21d273b6 100644 --- a/03-rnaseq/differential-expression_rnaseq_01.Rmd +++ b/03-rnaseq/differential-expression_rnaseq_01.Rmd @@ -117,7 +117,7 @@ Your new analysis folder should contain: - A folder for `plots` (currently empty) - A folder for `results` (currently empty) -Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using): +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): @@ -340,7 +340,7 @@ Using `lfcShrink()` can help decrease noise and preserve large differences betwe ```{r} deseq_results <- lfcShrink(deseq_object, # This is the original DESeq2 object with DESeq() already having been ran coef = 2, # This is based on what log fold change coefficient was used in DESeq(), the default is 2. - res = deseq_results # This needs to be the DESeq results table + res = deseq_results # This needs to be the DESeq2 results table ) ``` diff --git a/03-rnaseq/differential-expression_rnaseq_01.html b/03-rnaseq/differential-expression_rnaseq_01.html index af3d0a0f..1d5573d9 100644 --- a/03-rnaseq/differential-expression_rnaseq_01.html +++ b/03-rnaseq/differential-expression_rnaseq_01.html @@ -1754,7 +1754,7 @@

    2.6 Check out our file structure!
  • A folder for plots (currently empty)
  • A folder for results (currently empty)
  • -

    Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using):

    +

    Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. Run this chunk to double check that your files are in the right place.

    # Define the file path to the data directory
    @@ -1992,7 +1992,7 @@ 

    4.6 Run differential expression a
    ## final dispersion estimates
    ## fitting model and testing
    ## -- replacing outliers and refitting for 745 genes
    -## -- DESeq argument 'minReplicatesForReplace' = 7 
    +## -- DESeq2 argument 'minReplicatesForReplace' = 7 
     ## -- original counts are preserved in counts(dds)
    ## estimating dispersions
    ## fitting model and testing
    @@ -2001,7 +2001,7 @@

    4.6 Run differential expression a

    Here we will use lfcShrink() function to obtain shrunken log fold change estimates based on negative binomial distribution. This will add the estimates to your results table. Using lfcShrink() can help decrease noise and preserve large differences between groups (it requires that apeglm package be installed).

    deseq_results <- lfcShrink(deseq_object, # This is the original DESeq2 object with DESeq() already having been ran
       coef = 2, # This is based on what log fold change coefficient was used in DESeq(), the default is 2.
    -  res = deseq_results # This needs to be the DESeq results table
    +  res = deseq_results # This needs to be the DESeq2 results table
     )
    ## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
     ##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
    @@ -2052,7 +2052,7 @@ 

    4.6 Run differential expression a

    4.6.1 Check results by plotting one gene

    To double check what a differentially expressed gene looks like, we can plot one with DESeq2::plotCounts() function.

    plotCounts(ddset, gene = "ENSG00000196074", intgroup = "asxl_mutation_status")
    -

    +

    The mutation group samples have higher expression of this gene than the control group, which helps assure us that the results are showing us what we are looking for.

    diff --git a/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd b/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd index 740a9009..0857a897 100644 --- a/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd +++ b/03-rnaseq/dimension-reduction_rnaseq_01_pca.Rmd @@ -122,7 +122,7 @@ Your new analysis folder should contain: - The metadata TSV - A folder for `plots` (currently empty) - A folder for `results` (currently empty) -Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using): +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): diff --git a/03-rnaseq/dimension-reduction_rnaseq_01_pca.html b/03-rnaseq/dimension-reduction_rnaseq_01_pca.html index ca0e1950..50c26295 100644 --- a/03-rnaseq/dimension-reduction_rnaseq_01_pca.html +++ b/03-rnaseq/dimension-reduction_rnaseq_01_pca.html @@ -1758,7 +1758,7 @@

    2.6 Check out our file structure!
  • A folder for plots (currently empty)
  • A folder for results (currently empty)
    -Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using):
  • +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. Run this chunk to double check that your files are in the right place.

    diff --git a/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd b/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd index 29c431aa..e06e2901 100644 --- a/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd +++ b/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd @@ -122,7 +122,7 @@ Your new analysis folder should contain: - A folder for `plots` (currently empty) - A folder for `results` (currently empty) -Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using): +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): diff --git a/03-rnaseq/dimension-reduction_rnaseq_02_umap.html b/03-rnaseq/dimension-reduction_rnaseq_02_umap.html index 73a1c2b8..8ebd73ab 100644 --- a/03-rnaseq/dimension-reduction_rnaseq_02_umap.html +++ b/03-rnaseq/dimension-reduction_rnaseq_02_umap.html @@ -1755,7 +1755,7 @@

    2.6 Check out our file structure!
  • A folder for results (currently empty)
  • -

    Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using):

    +

    Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

    In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. Run this chunk to double check that your files are in the right place.

    # Define the file path to the data directory
    diff --git a/04-advanced-topics/validate_differential_expression_adv_topics_00_author_de.Rmd b/04-advanced-topics/validate_differential_expression_adv_topics_00_author_de.Rmd
    index b4d894d8..da0bc8e6 100644
    --- a/04-advanced-topics/validate_differential_expression_adv_topics_00_author_de.Rmd
    +++ b/04-advanced-topics/validate_differential_expression_adv_topics_00_author_de.Rmd
    @@ -169,7 +169,7 @@ Let's pick a probe from the results to double check our output.
     stats[32, ]
     ```
     
    -According to the data above, we should see that Affymetrix probe id `8154846` should
    +According to the data above, we should see that Affymetrix probe ID `8154846` should
     be higher in `SHH` group data than in `NonSHH` group data.
     Let's extract this data and make a boxplot to check if that is what we see.
     
    diff --git a/04-advanced-topics/validate_differential_expression_adv_topics_01.Rmd b/04-advanced-topics/validate_differential_expression_adv_topics_01.Rmd
    index a40a10dd..69b63afe 100644
    --- a/04-advanced-topics/validate_differential_expression_adv_topics_01.Rmd
    +++ b/04-advanced-topics/validate_differential_expression_adv_topics_01.Rmd
    @@ -187,7 +187,7 @@ Here we are only keeping adjusted p-values that are `< 0.05`.
     This is a typical cutoff, but depending on your results and how long of gene
     lists you would like to look at, you may need to adjust this.
     Next, we will group the probes by their associated Ensembl gene IDs and count
    -how many probesets are `up` and how many are `down` based on the `logical`
    +how many probe sets are `up` and how many are `down` based on the `logical`
     variable `direction` we made in the previous section.
     
     ```{r}
    @@ -209,22 +209,22 @@ direction.summary
     
     Now we have a count of how many probes are significant in each direction.
     We will summarize these gene level probe summaries into two gene lists.
    -Some genes may have probesets that are both up and down.
    +Some genes may have probe sets that are both up and down.
     For this analysis, we will only keep genes in the significance lists if all the
    -probesets are in the same direction.
    +probe sets are in the same direction.
     
     ```{r}
     # Create an up-regulated genes list
     author.up.genes <- direction.summary %>%
       # Upregulated genes are only those that have no significant downregulated
    -  # probesets
    +  # probe sets
       dplyr::filter(down == 0) %>%
       dplyr::pull(ENSEMBL) %>%
       as.character()
     
     author.down.genes <- direction.summary %>%
       # Downregulated genes are only those that have no significant upregulated
    -  # probesets
    +  # probe sets
       dplyr::filter(up == 0) %>%
       dplyr::pull(ENSEMBL) %>%
       as.character()
    diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
    index 4dc29031..03dd24f9 100644
    --- a/CONTRIBUTING.md
    +++ b/CONTRIBUTING.md
    @@ -180,13 +180,16 @@ This will help fix some spacing and formatting issues automatically.
     #### Formatting of typical words/items:
     
       - Use "data frame" NOT data.frame or `data frame` (unless referring to the function which should be `data.frame()`)
    -  - Use "IDs" or "ID", NOT ids/id or Ids/Ids
    +  - Use et al., NOT et. al. 
    +  - Use "gene sets", NOT "genesets"
    +  - Use "IDs" or "ID", NOT "ids"/"id" or "Ids"/"Ids"
       - Use "NA" or "NAs", NOT na/nas or Na or `NA` or `NA`s or NA's
       - Use "PNG", NOT png or `png` or .png (and etc.)
    +  - Use "probe sets", NOT "probesets"
       - Use "refine.bio", NOT "refinebio"
       - Use `.Rmd`,  NOT "Rmd" or ".Rmd"
       - Use "tidyverse", NOT "Tidyverse"
    -  - Use "TSV",  NOT tsv or `tsv` or .tsv
    +  - Use "TSV",  NOT "tsv" or `tsv` or ".tsv"
     
       - **Functions**: For function references in paragraph, use `getwd()`; with backticks and empty parentheses.
       Since function calls always involve `()` being consistent about this adding in this notation might be helpful for beginning R users referencing our examples.  
    diff --git a/components/dictionary.txt b/components/dictionary.txt
    index 23da1121..bbed692b 100644
    --- a/components/dictionary.txt
    +++ b/components/dictionary.txt
    @@ -1,7 +1,6 @@
     actin
     ADT
     al
    -al.
     AML
     AnaLysis
     ASXL
    @@ -34,7 +33,7 @@ devtools
     DGE
     directionality
     DocToc
    -Dorsoventral
    +dorsoventral
     ECM
     edgeR
     edgeR's
    @@ -44,12 +43,10 @@ Ensembl
     ENSG
     Entrez
     et
    -et.
     FACS
     functionalize
     FPKM
     GEne
    -genesets
     generalizable
     ggplot
     GitHub
    @@ -68,7 +65,7 @@ Illumina
     IRF
     isoform
     isoforms
    -jpeg
    +JPEG
     KEGG
     limma
     logFC
    @@ -85,12 +82,11 @@ orthology
     overexpressing
     overexpression
     pheatmap
    -PLoS
    +PLOS
     PLX
     PNAS
     PNG
     polymerase
    -probesets
     prostatectomy
     PRPS
     pre
    diff --git a/template/template_example.Rmd b/template/template_example.Rmd
    index 19a11678..1ebb6f36 100644
    --- a/template/template_example.Rmd
    +++ b/template/template_example.Rmd
    @@ -119,7 +119,7 @@ Your new analysis folder should contain:
     - A folder for `plots` (currently empty)  
     - A folder for `results` (currently empty)  
         
    -Your example analysis folder should now look something like this (except with respective experiment accession id and analysis notebook name you are using): 
    +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): 
     
     
     
    @@ -194,7 +194,7 @@ metadata <- readr::read_tsv(file.path(data_dir, # Replace with path to your meta
     df <- readr::read_tsv(file.path(data_dir, # Replace with path to your data file
                                     {{DATA_ACCESSION FILENAME}} # Replace with the name of your data file
                                     )) %>%
    -  # Tuck away the Gene id column as rownames
    +  # Tuck away the gene ID  column as rownames
       tibble::column_to_rownames("Gene")
     ```