The Splice Girls

This repository is a clone of the private repository under the STAT540 organization.

Team Members

Name	Department/Program	Expertise/Interests	GitHub ID
Diana Lin	Bioinformatics	Genome/Transcriptome Assembly, Bioinformatics Pipeline Development, Mammalian Physiology, Datamining	@dy-lin
Almas Khan	Bioinformatics	Epigenetics, mammalian biology, microbiology	@almas2019
Denitsa Vasileva	Bioinformatics	Gene Set Enrichment Analysis, Epigenetics, PPI Networks	@Deni678
Nairuz Elazzabi	Medical Genetics	Neurodevelopmental Genetics, Epigenetics, Transcriptome Profiling	@Nayrouz109

Project Proposal

Our project proposal can be viewed here.

Progress Report

Our progress report can be found here.

Dataset

GSE124052

Presentation

Our presentation slides can be found here.

Our presentation document can be found here.

Summary

Data Retrieval

Aims: to retrieve the GEO dataset and output it in a readily usable form (i.e. reshaped, filtered, wrangled, etc.), using src/fetch_data.R.
Conclusions: there are a lot of .rds objects written by the script but are not present in the repository due to the GitHub file size limit. Those that are present can be found in data/raw_data/ and data/processed_data/.

Principal Component Analysis

Aims: to see if there is an underlying signal that causes the data to cluster in a certain way, using src/PCA_final.R.
Results: results can be found in the results/final/ directory as well as in the documentation docs/PCA.md.
Conclusions: the data for PC1 vs PC2; PC2 vs PC3; PC1 vs PC3; does not cluster by any of the covariates.

Box Plots

Aims: to discern any patterns in beta value distribution across the various samples, using src/boxplot_final.R.
Results: results can be found in the results/final/ directory as well as in the documentation docs/boxplots.md.
Conclusions: there is no discernable pattern in the distribution of beta values across the samples.

P-value Histograms

Aims: To visualize p-value distribtuin for two different models cancer_type*age and cancer_type, using src/pValueHistogram.R.
Results: results can be found in results/final/ directory as well as in the documentation docs/pValue_Histogram.md.
Conclusions: The model cancer_type*age does not produce enough statisitical power as pvalues are "uniformally" distributed and thus we cannot reject the null hypothesis that the interaction between cancerType and age does not affect methylation signal. However, the model cancer_type produces some statistical power as many pvalues appear as a peak near 0 and thus providing enough evidence to reject the null hypothesis.

P-value Density Plot

Aims: To compute and visualize p-value density estimates. pValues were generated using a two-sample t-test of the two samples cancer_type$primary and cancer_type*metastatics, using src/pvalue_density_plot.R.
Results: results can be found in results/final/ directory, as well as in the documentation docs/pValue_density.md.
Conclusions: The difference in mean methylation signal between the two samples cancer_type$primary and cancer_type$metastaticseems to be significant enough to reject the null hypothesis as many small pvalues are displayed in the plot.

Beta Density Plot

Aims: to examine the density of beta values across two different cancer types using the src/beta_density_plot.R script.
Results: results can be found in results/final/ directory as well as in the documentation docs/beta_density.md.
Conclusions: no significant difference in the density distribution of beta values in the different cancer types

Hierarchical Clustering

Aims: To perform hierarchical clustering of beta values to see if there is a pattern of certain samples clustering together, using src/hierarchical_clustering.R.
Results: results can be found in the results/final/ directory, as well as in the documentation docs/hierarchical_clustering.md.
Conclusions: No common patterns of clusters was seen when colouring covariates.

Linear Regression

Aims: To perform linear regression to get a better understanding of differentially methylated genes across a variety of conditions using the script src/linear_regression.R.The conditions were age, cancer_type (primary vs metastatic) and an interaction term of cancer_type:age.
Results: results can be found in the results/final/ and results/revised/directory, as well as in the documentation docs/Linear_regression_results.md.
Conclusions: Based on the high adjusted p values (calculated using B-H method of multiple test correction), there was no signficantly differentially methylated genes in any of the conditions mentioned above.

Strip Plots

Aims: to visualize the distribution of beta values across all samples faceted by the top ten genes, using src/strip_plot.R.
Results: results can be found in the results/final/ directory as well as in the documentation docs/strip_plots.md.
Conclusions: the top 10 genes include those that are positively differentially methylated in primary cancers, as well as genes that are negatively differentially methylated in primary cancers.

Chromosome Plot

Aims: to visualize the differentially methylated regions along all the chromosomes, using src/chr_plot.R.
Results: results can be found in the results/final/ directory as well as in the documentation docs/chr_plot.md.
Conclusions: there are differentially methylated regions along every single chromosome.

Pathway Analysis

Aims: To identify KEGG pathways that the genes identified by limma are enriched in,using src/API_pathways_KEGG.R.
Results: results can be found in the results/final/ directory as well as in the documentation docs/pathway_KEGG_top_genes.md.
Conclusions: Genes in KEGG pathways that have been previously implicated in HNSCC in the literature were shown to be enriched in our analysis.

Gene Set Enrichment Analysis

Aims: To identify gene sets enriched for the top genes identified by the limma analysis,using docs/gene_set_enrichment.md
Results: The script docs/gene_set_enrichment.md displayed the following error message:

Error in (function (annotation = NULL, aspects = c("Molecular Function", : Something went wrong. 
Blame the dev 
INFO: Data directory is /Users/almas/ermineJ.data 
DEBUG: Custom gene sets directory is /Users/almas/ermineJ.data/genesets 
DEBUG: Setting score start column to 2 
INFO: Gene symbols for each term will be output Reading GO descriptions from /Users/almas/Documents/GitHub/Repo_team_The-Splice-Girls_W2020/src/GO.xml ... 
INFO: Could not locate aspect for GO:0018704: obsolete 5-chloro-2-hydroxymuconic semialdehyde dehalogenase activity
INFO: Could not locate aspect for GO:0006494: obsolete protein amino acid terminal glycosylation 
INFO: Could not locate aspect for GO:0006496: obsolete protein amino acid terminal N-glycosylation 
INFO: Could not locate aspect for GO:0090413: obsolete negative regulation of transcription from RNA polymerase II promoter involved in fatty acid biosynthetic process 
INFO: Could not locate aspect for GO:0090412: obsolete positive regulation of transcription from RNA polymerase II promoter involved in fatty acid biosynthet

Conclusions: Software incompatibility prevented us from running this analysis.

Division of Labour

Group Member	Scripts	Other
Diana Lin	`src/fetch_data.R`, `src/PCA_final.R`, `src/PCA_revised.R`, `src/boxplot_final.R`, `src/strip_plot.R`, `src/chr_plot.R`	GitHub management (directory structure, READMEs, general issues), administrative tasks (deliverable submission)
Almas Khan	`src/hierarchical_clustering.R`, `src/linear_regression.R`	GitHub Management (moving files, editing READMEs)
Denitsa Vasileva	`src/beta_density_plot.R`, `src/API_pathways_KEGG.R`	GitHub management (editing READMEs), administrative tasks (deliverable submission)
Nairuz Elazzabi	`src/pValueHistogram.R`, `src/pvalue_density_plot.R`, `src/pValHistogram_Revised.R`	GitHub management (editing READMEs)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
docs		docs
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
presentation.pdf		presentation.pdf
presentation.pptx		presentation.pptx
presentation_document.docx		presentation_document.docx
progress_report.md		progress_report.md
project_proposal.md		project_proposal.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Splice Girls

Table of Contents

Team Members

Project Proposal

Progress Report

Dataset

Presentation

Summary

Data Retrieval

Principal Component Analysis

Box Plots

P-value Histograms

P-value Density Plot

Beta Density Plot

Hierarchical Clustering

Linear Regression

Strip Plots

Chromosome Plot

Pathway Analysis

Gene Set Enrichment Analysis

Division of Labour

About

Releases

Packages

Languages

License

JiaheZh/stat540-project

Folders and files

Latest commit

History

Repository files navigation

The Splice Girls

Table of Contents

Team Members

Project Proposal

Progress Report

Dataset

Presentation

Summary

Data Retrieval

Principal Component Analysis

Box Plots

P-value Histograms

P-value Density Plot

Beta Density Plot

Hierarchical Clustering

Linear Regression

Strip Plots

Chromosome Plot

Pathway Analysis

Gene Set Enrichment Analysis

Division of Labour

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages