Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert mouse gene identifiers to human ones that match data in GWAS summary data #3

Open
Jylab-Genetics opened this issue Nov 4, 2024 · 7 comments

Comments

@Jylab-Genetics
Copy link

Why do we need to execute 'convert mouse gene identifiers to human ones that match data in GWAS summary data'? I don't quite understand. Are GWAS sources different from single-cell data sources?

@VincentQLai
Copy link
Collaborator

Yes, they are sometimes different.

For a genome-wide association study (GWAS), hundreds of thousands of whole-genome sequencing data are collected and statistical tests are performed for each SNP, which is primarily encoded as information from the human genome. While for single-cell RNA-seq, it can be performed for any organisms. So there is a need for converting the gene ID mapping so that the data can match.

In terms of the specific issue of seismic, it requires the input of MAGMA gene-level Z-score file as an input, where the SNP-level statistics are aggregated to gene-level, which is encoded as Human Entrez ID. As a result, unless the gene names are already encoded as Human Entrez ID in the scRNA-seq data, there is a need for gene ID conversion. Currently seismic's innate data structure can only handle conversions between several listed gene ID types. We plan to implement more flexible conversion options in future updates.

Hope this information helps address your question.

Screenshot 2024-11-04 at 12 56 03 PM

@Jylab-Genetics
Copy link
Author

Thank you for your prompt response. This software demonstrates exceptional flexibility in integrating SNP and single-cell RNA-seq data, especially in gene ID conversion and data matching, offering unprecedented convenience for association analyses at the gene level and opening new avenues for research.

Regarding cross-species compatibility, I would like to confirm: does the software currently mainly support association data between humans and mice? Is it possible to extend this support to integrate human SNP information with single-cell data from non-human primates or other mammals? Such an expansion would be highly valuable for cross-species genetic association studies, helping to uncover the molecular mechanisms of specific traits across different species. I look forward to your further clarification.

@VincentQLai
Copy link
Collaborator

VincentQLai commented Nov 27, 2024

Yes, our framework supports finding associated cell types of any traits in either mouse or human scRNA-seq datasets. Unfortunately, as you can see from the previous screenshot, the current version of seismicGWAS uses a static internal gene mapping table, limited to mouse and human gene symbols. In future versions, we plan to allow customized gene mapping inputs. While most current scRNA-seq datasets are human or mouse-based, we realize the growing use of other organisms and plan to accommodate them in upcoming updates.
If this functionality is critical for your work, please let us know, and we can prioritize its implementation.

@VincentQLai
Copy link
Collaborator

VincentQLai commented Nov 28, 2024

Great news! Our latest version now supports the translation of gene ids for the gene-level seismic specificity score using customized gene mapping table. Please refer to the function description page for the usage of the exact arguments. Here is an example of how you may create a customized data frame of gene mapping and use seismicGWAS to identify associated cell types. Assuming you've reached the step where there is a specificity score matrix and you would like to translate it from mouse gene symbols to human Entrez ID using a customized gene mapping table. For example, you may borrow the information of homology mapping using homologene package.

#Load  homolgene package
library("homolgene")

#Create gene mapping table mouse (Taxonomy ID: 10090) and human (Taxonomy ID: 9606)
#To search for species taxonomy ID, please refer to NCBI
mapping <- homologene(genes = rownames(tmfacs_sce_small),inTax = 10090, outTax = 9606)

#Translate gene id
tmfacs_sscore_hsa <- translate_gene_ids(tmfacs_sscore, from='10090', to = "9606_ID", gene_mapping_table = mapping)

#Prioritize associated cell types
t2d <- get_ct_trait_associations(tmfacs_sscore_hsa, t2d_magma)

@Jylab-Genetics
Copy link
Author

Jylab-Genetics commented Nov 28, 2024 via email

@VincentQLai
Copy link
Collaborator

VincentQLai commented Nov 28, 2024

Thank you for sharing your results! Our seismic framework integrates GWAS data to prioritize cell types in normal tissues that may exhibit genetic vulnerability. A variety of factors can influence the results. Below is a troubleshooting checklist to help identify potential issues:

  • Quality Control: Ensure that low-quality cells have been removed, including: cells with insufficient RNA counts, cells identified as doublets, cells with excessive mitochondrial gene expression.
  • Normalization. Verify that the data have been properly normalized. We recommend using scran for size factor calculation.
  • GWAS Summary Statistics: Ensure the GWAS summary statistics file is processed correctly. In our study, we used a window size of 35kb upstream and 10kb downstream. While seismic shows consistency across different window sizes, this parameter may still influence outcomes. The population structure should ideally be simple (e.g., cohorts of European ancestry) to allow MAGMA to effectively account for linkage disequilibrium (LD). Additionally, verify that the SNPs are annotated to the correct genome assembly version. If they are based on a noncanonical version, update the auxiliary files for MAGMA or use genome liftover tools to adjust SNP locations.
  • Granularity of Analysis: Select the appropriate granularity for your analysis. Genetic diseases may only affect specific subpopulations within a cell type, making signals harder to detect at broader levels. If this is the case, consider dividing the broader cell type into more granular subpopulations to enhance detection.
  • GWAS Quality: Evaluate the quality of the GWAS data. GWAS can vary significantly depending on the donor recruitment, quality control, covariate adjustments, and methodology used in the GWAS. Recent studies often exhibit better signals, but some older or specialized GWAS datasets may also produce strong, clear results.

@Jylab-Genetics
Copy link
Author

Jylab-Genetics commented Nov 29, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants