Skip to content

Group project for STAT540: Statistical Methods for High Dimensional Biology

License

Notifications You must be signed in to change notification settings

JiaheZh/stat540-project

 
 

Repository files navigation

The Splice Girls

This repository is a clone of the private repository under the STAT540 organization.

Table of Contents

  1. Team Members
  2. Project Proposal
  3. Dataset
  4. Presentation
  5. Summary
    1. Data Retrieval
    2. Principal Component Analysis
    3. Beta Density Plot
    4. Box Plots
    5. P-value Histograms
    6. P-value Density Plot
    7. Hierarchical Clustering
    8. Linear Regression
    9. Strip Plots
    10. Chromosome Plot
    11. Pathway Analysis
    12. Gene Set Enrichment Analysis
  6. Division of Labour

Team Members

Name Department/Program Expertise/Interests GitHub ID
Diana Lin Bioinformatics Genome/Transcriptome Assembly, Bioinformatics Pipeline Development, Mammalian Physiology, Datamining @dy-lin
Almas Khan Bioinformatics Epigenetics, mammalian biology, microbiology @almas2019
Denitsa Vasileva Bioinformatics Gene Set Enrichment Analysis, Epigenetics, PPI Networks @Deni678
Nairuz Elazzabi Medical Genetics Neurodevelopmental Genetics, Epigenetics, Transcriptome Profiling @Nayrouz109

Project Proposal

Our project proposal can be viewed here.

Progress Report

Our progress report can be found here.

Dataset

Presentation

Our presentation slides can be found here.

Our presentation document can be found here.

Summary

Data Retrieval

Aims: to retrieve the GEO dataset and output it in a readily usable form (i.e. reshaped, filtered, wrangled, etc.), using src/fetch_data.R.
Conclusions: there are a lot of .rds objects written by the script but are not present in the repository due to the GitHub file size limit. Those that are present can be found in data/raw_data/ and data/processed_data/.

Principal Component Analysis

Aims: to see if there is an underlying signal that causes the data to cluster in a certain way, using src/PCA_final.R.
Results: results can be found in the results/final/ directory as well as in the documentation docs/PCA.md.
Conclusions: the data for PC1 vs PC2; PC2 vs PC3; PC1 vs PC3; does not cluster by any of the covariates.

Box Plots

Aims: to discern any patterns in beta value distribution across the various samples, using src/boxplot_final.R.
Results: results can be found in the results/final/ directory as well as in the documentation docs/boxplots.md.
Conclusions: there is no discernable pattern in the distribution of beta values across the samples.

P-value Histograms

Aims: To visualize p-value distribtuin for two different models cancer_type*age and cancer_type, using src/pValueHistogram.R.
Results: results can be found in results/final/ directory as well as in the documentation docs/pValue_Histogram.md.
Conclusions: The model cancer_type*age does not produce enough statisitical power as pvalues are "uniformally" distributed and thus we cannot reject the null hypothesis that the interaction between cancerType and age does not affect methylation signal. However, the model cancer_type produces some statistical power as many pvalues appear as a peak near 0 and thus providing enough evidence to reject the null hypothesis.

P-value Density Plot

Aims: To compute and visualize p-value density estimates. pValues were generated using a two-sample t-test of the two samples cancer_type$primary and cancer_type*metastatics, using src/pvalue_density_plot.R.
Results: results can be found in results/final/ directory, as well as in the documentation docs/pValue_density.md.
Conclusions: The difference in mean methylation signal between the two samples cancer_type$primary and cancer_type$metastaticseems to be significant enough to reject the null hypothesis as many small pvalues are displayed in the plot.

Beta Density Plot

Aims: to examine the density of beta values across two different cancer types using the src/beta_density_plot.R script.
Results: results can be found in results/final/ directory as well as in the documentation docs/beta_density.md.
Conclusions: no significant difference in the density distribution of beta values in the different cancer types

Hierarchical Clustering

Aims: To perform hierarchical clustering of beta values to see if there is a pattern of certain samples clustering together, using src/hierarchical_clustering.R.
Results: results can be found in the results/final/ directory, as well as in the documentation docs/hierarchical_clustering.md.
Conclusions: No common patterns of clusters was seen when colouring covariates.

Linear Regression

Aims: To perform linear regression to get a better understanding of differentially methylated genes across a variety of conditions using the script src/linear_regression.R.The conditions were age, cancer_type (primary vs metastatic) and an interaction term of cancer_type:age.
Results: results can be found in the results/final/ and results/revised/directory, as well as in the documentation docs/Linear_regression_results.md.
Conclusions: Based on the high adjusted p values (calculated using B-H method of multiple test correction), there was no signficantly differentially methylated genes in any of the conditions mentioned above.

Strip Plots

Aims: to visualize the distribution of beta values across all samples faceted by the top ten genes, using src/strip_plot.R.
Results: results can be found in the results/final/ directory as well as in the documentation docs/strip_plots.md.
Conclusions: the top 10 genes include those that are positively differentially methylated in primary cancers, as well as genes that are negatively differentially methylated in primary cancers.

Chromosome Plot

Aims: to visualize the differentially methylated regions along all the chromosomes, using src/chr_plot.R.
Results: results can be found in the results/final/ directory as well as in the documentation docs/chr_plot.md.
Conclusions: there are differentially methylated regions along every single chromosome.

Pathway Analysis

Aims: To identify KEGG pathways that the genes identified by limma are enriched in,using src/API_pathways_KEGG.R.
Results: results can be found in the results/final/ directory as well as in the documentation docs/pathway_KEGG_top_genes.md.
Conclusions: Genes in KEGG pathways that have been previously implicated in HNSCC in the literature were shown to be enriched in our analysis.

Gene Set Enrichment Analysis

Aims: To identify gene sets enriched for the top genes identified by the limma analysis,using docs/gene_set_enrichment.md
Results: The script docs/gene_set_enrichment.md displayed the following error message:

Error in (function (annotation = NULL, aspects = c("Molecular Function", : Something went wrong. 
Blame the dev 
INFO: Data directory is /Users/almas/ermineJ.data 
DEBUG: Custom gene sets directory is /Users/almas/ermineJ.data/genesets 
DEBUG: Setting score start column to 2 
INFO: Gene symbols for each term will be output Reading GO descriptions from /Users/almas/Documents/GitHub/Repo_team_The-Splice-Girls_W2020/src/GO.xml ... 
INFO: Could not locate aspect for GO:0018704: obsolete 5-chloro-2-hydroxymuconic semialdehyde dehalogenase activity
INFO: Could not locate aspect for GO:0006494: obsolete protein amino acid terminal glycosylation 
INFO: Could not locate aspect for GO:0006496: obsolete protein amino acid terminal N-glycosylation 
INFO: Could not locate aspect for GO:0090413: obsolete negative regulation of transcription from RNA polymerase II promoter involved in fatty acid biosynthetic process 
INFO: Could not locate aspect for GO:0090412: obsolete positive regulation of transcription from RNA polymerase II promoter involved in fatty acid biosynthet

Conclusions: Software incompatibility prevented us from running this analysis.

Division of Labour

Group Member Scripts Other
Diana Lin src/fetch_data.R, src/PCA_final.R, src/PCA_revised.R, src/boxplot_final.R, src/strip_plot.R, src/chr_plot.R GitHub management (directory structure, READMEs, general issues), administrative tasks (deliverable submission)
Almas Khan src/hierarchical_clustering.R, src/linear_regression.R GitHub Management (moving files, editing READMEs)
Denitsa Vasileva src/beta_density_plot.R, src/API_pathways_KEGG.R GitHub management (editing READMEs), administrative tasks (deliverable submission)
Nairuz Elazzabi src/pValueHistogram.R, src/pvalue_density_plot.R, src/pValHistogram_Revised.R GitHub management (editing READMEs)

About

Group project for STAT540: Statistical Methods for High Dimensional Biology

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%