Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative way to get genomic positions #144

Open
grst opened this issue Oct 4, 2024 · 2 comments · May be fixed by #150
Open

Alternative way to get genomic positions #144

grst opened this issue Oct 4, 2024 · 2 comments · May be fixed by #150
Labels
enhancement New feature or request

Comments

@grst
Copy link
Member

grst commented Oct 4, 2024

Description of feature

Currently the only way to get genomic positions is through reading a GTF file. This is (a) slow and (b) gtfparse repeatedly makes problems.

It could be more conveniente to retrieve this information from online sources such as biomart or Bioconductor AnnotationHub.

Then gtfparse could become an optional dependency.

@zktuong
Copy link

zktuong commented Oct 4, 2024

This works very well for me as well
https://scanpy.readthedocs.io/en/stable/generated/scanpy.queries.biomart_annotations.html

def query_biomart() -> pd.DataFrame:
    """
    Extract gene annotations from Biomart.

    Parameters
    ----------
    index_key : str, optional
        Index key for the DataFrame.

    Returns
    -------
    pd.DataFrame
        DataFrame with gene annotations from Biomart.
    """
    annot = sc.queries.biomart_annotations(
        "hsapiens",
        [
            "ensembl_gene_id",
            "hgnc_symbol",
            "start_position",
            "end_position",
            "chromosome_name",
        ],
        use_cache=True,
    ).rename(
        columns={
            "ensembl_gene_id": "gene_ids",
            "hgnc_symbol": "gene_symbol",
            "start_position": "start",
            "end_position": "end",
            "chromosome_name": "chromosome",
        }
    )
    return annot
    
def annotate_var(
    adata: AnnData, annotation: pd.DataFrame, index_key: str = "gene_ids"
) -> None:
    """
    Annotate the features with in an AnnData object.

    Parameters
    ----------
    adata : AnnData
        Input AnnData object.
    annotation : pd.DataFrame
        Gene annotation DataFrame.
    index_key : str, optional
        Index key for the DataFrame.
    """
    for col in ["start", "end", "chromosome", index_key]:
        assert (
            col in annotation.columns
        ), f"Annotation DataFrame must contain the column named `{col}`."

    for col in annotation:
        var_dict = annotation[col].to_dict()
        adata.var[col] = [
            var_dict[x] if x in var_dict else None for x in adata.var[index_key]
        ]

@grst
Copy link
Member Author

grst commented Oct 4, 2024

very nice 🤩

@grst grst linked a pull request Jan 17, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants