Skip to content

Commit

Permalink
docs: add gentropy first steps guide (opentargets#449)
Browse files Browse the repository at this point in the history
* docs: add tutorials

* docs: docufriday

* docs: minor addition to how-to run class method

* docs: fix typo

* docs: linking available datasets, methods

* docs: add `inspect_dataset` page

* docs: add whatsnext sections

* chore: center project badges

* docs: add key features to index

* test(docs): testing doc snippets proof of concept (opentargets#451)

* build(deps): bump pandas from 2.1.4 to 2.2.0 (opentargets#443)

Bumps [pandas](https://github.com/pandas-dev/pandas) from 2.1.4 to 2.2.0.
- [Release notes](https://github.com/pandas-dev/pandas/releases)
- [Commits](pandas-dev/pandas@v2.1.4...v2.2.0)

---
updated-dependencies:
- dependency-name: pandas
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump scikit-learn from 1.3.2 to 1.4.0 (opentargets#444)

Bumps [scikit-learn](https://github.com/scikit-learn/scikit-learn) from 1.3.2 to 1.4.0.
- [Release notes](https://github.com/scikit-learn/scikit-learn/releases)
- [Commits](scikit-learn/scikit-learn@1.3.2...1.4.0)

---
updated-dependencies:
- dependency-name: scikit-learn
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* test(docs): testing doc snippets proof of concept

* docs: extend and test python snippets to all docs

* fix: correct typing in apply_class_method_clumping

* fix: update imports in doc tests

* chore: harmonizing finngen configuration (opentargets#454)

* chore: harmonizing finngen configuration

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* chore: finalising finngen config update

* fix: reverting .env file

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* chore(deps): bump codecov/codecov-action from 3 to 4 (opentargets#464)

Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 3 to 4.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](codecov/codecov-action@v3...v4)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* ci: ignore patch versions in dependabot (opentargets#462)

* build(deps-dev): bump ruff from 0.1.8 to 0.2.0 (opentargets#465)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.1.8 to 0.2.0.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@v0.1.8...v0.2.0)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump ipython from 8.20.0 to 8.21.0 (opentargets#466)

Bumps [ipython](https://github.com/ipython/ipython) from 8.20.0 to 8.21.0.
- [Release notes](https://github.com/ipython/ipython/releases)
- [Commits](ipython/ipython@8.20.0...8.21.0)

---
updated-dependencies:
- dependency-name: ipython
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): bump pytest-sugar from 0.9.7 to 1.0.0 (opentargets#467)

Bumps [pytest-sugar](https://github.com/Teemu/pytest-sugar) from 0.9.7 to 1.0.0.
- [Release notes](https://github.com/Teemu/pytest-sugar/releases)
- [Changelog](https://github.com/Teemu/pytest-sugar/blob/main/CHANGES.rst)
- [Commits](Teemu/pytest-sugar@v0.9.7...v1.0.0)

---
updated-dependencies:
- dependency-name: pytest-sugar
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Irene López <[email protected]>

* build(deps-dev): bump google-cloud-dataproc from 5.8.0 to 5.9.0 (opentargets#468)

Bumps [google-cloud-dataproc](https://github.com/googleapis/google-cloud-python) from 5.8.0 to 5.9.0.
- [Release notes](https://github.com/googleapis/google-cloud-python/releases)
- [Changelog](https://github.com/googleapis/google-cloud-python/blob/main/packages/google-cloud-documentai/CHANGELOG.md)
- [Commits](googleapis/google-cloud-python@google-cloud-dataproc-v5.8.0...google-cloud-dataproc-v5.9.0)

---
updated-dependencies:
- dependency-name: google-cloud-dataproc
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Irene López <[email protected]>

* fix(test): do not create session and pass it as a parameter

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Irene López <[email protected]>
Co-authored-by: Daniel Suveges <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Irene López <[email protected]>

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: David Ochoa <[email protected]>
Co-authored-by: David Ochoa <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Daniel Suveges <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
6 people authored Feb 5, 2024
1 parent 70ae88d commit 40af77f
Show file tree
Hide file tree
Showing 22 changed files with 482 additions and 19 deletions.
18 changes: 10 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/gentropy.svg)](https://pypi.python.org/pypi/gentropy/)
[![PyPI version](https://badge.fury.io/py/gentropy.svg)](https://badge.fury.io/py/gentropy)
[![image](https://github.com/opentargets/gentropy/actions/workflows/release.yaml/badge.svg)](https://opentargets.github.io/gentropy/)
[![codecov](https://codecov.io/gh/opentargets/gentropy/branch/main/graph/badge.svg?token=5ixzgu8KFP)](https://codecov.io/gh/opentargets/gentropy)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10527086.svg)](https://doi.org/10.5281/zenodo.10527086)

<p align="center">
<img width=100% height=250px src="https://raw.githubusercontent.com/opentargets/gentropy/dev/docs/assets/imgs/gentropy.svg">
</p>

<p align="center">
<a href="https://github.com/astral-sh/ruff"><img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff" /></a>
<a href="https://pypi.python.org/pypi/gentropy/"><img src="https://img.shields.io/pypi/pyversions/gentropy.svg" alt="PyPI pyversions" /></a>
<a href="https://badge.fury.io/py/gentropy"><img src="https://badge.fury.io/py/gentropy.svg" alt="PyPI version" /></a>
<a href="https://opentargets.github.io/gentropy/"><img src="https://github.com/opentargets/gentropy/actions/workflows/release.yaml/badge.svg" alt="image" /></a>
<a href="https://codecov.io/gh/opentargets/gentropy"><img src="https://codecov.io/gh/opentargets/gentropy/branch/main/graph/badge.svg?token=5ixzgu8KFP" alt="codecov" /></a>
<a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License" /></a>
<a href="https://doi.org/10.5281/zenodo.10527086"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.10527086.svg" alt="DOI" /></a>
</p>

Open Targets Gentropy is a Python package to facilitate the interpretation and analysis of GWAS and functional genomic studies for target identification. The package contains a toolkit for the harmonisation, statistical analysis and prioritisation of genetic signals to assist drug discovery.

## Installation
Expand Down
1 change: 1 addition & 0 deletions docs/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Docs package."""
3 changes: 3 additions & 0 deletions docs/howto/_howto.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,7 @@

This page contains a collection of how-to guides for the project.

- [**Command line interface**](command_line/_command_line.md): Learn how to use the Gentropy CLI.
- [**Python API**](python_api/_python_api.md): Learn how to use the Gentropy Python package.

For additional information please visit [https://community.opentargets.org/](https://community.opentargets.org/)
7 changes: 7 additions & 0 deletions docs/howto/command_line/_command_line.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
Title: Command line interface
---

# Command line interface

Gentropy steps can be run using the command line interface (CLI). This section contains a collection of how-to guides for the CLI.
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,4 @@ In most occassions, some mandatory values will be required to run the step. For
gentropy step=gene_index step.target_path=/path/to/target step.gene_index_path=/path/to/gene_index
```

You can find more about the available steps in the [documentation](../python_api/steps/_steps.md).
You can find more about the available steps in the [documentation](../../python_api/steps/_steps.md).
File renamed without changes.
5 changes: 5 additions & 0 deletions docs/howto/python_api/_python_api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: Python API
---

This section explains how to use gentropy in a Python environment providing a foundational understanding on how to perform genetics analyses using the package. This section can be useful for users wishing to use Gentropy in their own projects.
33 changes: 33 additions & 0 deletions docs/howto/python_api/a_creating_spark_session.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
title: Creating a Spark Session
---

In this section, we'll guide you through creating a Spark session using Gentropy's Session class. Gentropy uses _Apache PySpark_ as the underlying framework for distributed computing. The Session class provides a convenient way to initialize a Spark session with pre-configured settings.

## Creating a Default Session

To begin your journey with Gentropy, start by creating a default Spark session. This is the simplest way to initialize your environment.

```python
--8<-- "src_snippets/howto/python_api/a_creating_spark_session.py:default_session"
```

The above code snippet sets up a default Spark session with pre-configured settings. This is ideal for getting started quickly without needing to tweak any configurations.

## Customizing Your Spark Session

Gentropy allows you to customize the Spark session to suit your specific needs. You can modify various parameters such as memory allocation, number of executors, and more. This flexibility is particularly useful for optimizing performance in steps that are more computationally intensive.

### Example: Increasing Driver Memory

If you require more memory for the Spark driver, you can easily adjust this setting:

```python
--8<-- "src_snippets/howto/python_api/a_creating_spark_session.py:custom_session"
```

This code snippet demonstrates how to increase the memory allocated to the Spark driver to 16 gigabytes. You can customize other Spark settings similarly, according to your project's requirements.

## What's next?

Now that you've created a Spark session, you're ready to start using Gentropy. In the next section, we'll show you how to process a large dataset using Gentropy's powerful _SummaryStatistics_ datatype.
61 changes: 61 additions & 0 deletions docs/howto/python_api/b_create_dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
title: Create a dataset
---

Gentropy provides a collection of `Dataset`s that encapsulate key concepts in the field of genetics. For example, to represent summary statistics, you'll use the [`SummaryStatistics`](../../python_api/datasets/summary_statistics.md) class. This datatype comes with a set of useful operations to disentangle the genetic architecture of a trait or disease.

The full list of `Dataset`s is available in the Python API [documentation](../../python_api/datasets/_datasets.md).

!!! info "Any instance of Dataset will have 2 common attributes"

- **df**: the Spark DataFrame that contains the data
- **schema**: the definition of the data structure in Spark format

In this section you'll learn the different ways of how to create a `Dataset` instances.

## Creating a dataset from parquet

All the `Dataset`s have a `from_parquet` method that allows you to create any `Dataset` instance from a parquet file or directory.

```python
--8<-- "src_snippets/howto/python_api/b_create_dataset.py:create_from_parquet_import"
path = "path/to/summary/stats"
--8<-- "src_snippets/howto/python_api/b_create_dataset.py:create_from_parquet"
```

!!! info "Parquet files"

Parquet is a columnar storage format that is widely used in the Spark ecosystem. It is the recommended format for storing large datasets. For more information about parquet, please visit [https://parquet.apache.org/](https://parquet.apache.org/).

## Creating a dataset from a data source

Alternatively, `Dataset`s can be created using a [data source](../../python_api/datasources/_datasources.md) harmonisation method. For example, to create a `SummaryStatistics` object from Finngen's raw summary statistics, you can use the [`FinnGen`](../../python_api/datasources/finngen/summary_stats.md) data source.

```python
--8<-- "src_snippets/howto/python_api/b_create_dataset.py:create_from_source_import"
path = "path/to/finngen/summary/stats"
--8<-- "src_snippets/howto/python_api/b_create_dataset.py:create_from_source"
```

## Creating a dataset from a pandas DataFrame

If none of our data sources fit your needs, you can create a `Dataset` object from your own data. To do so, you need to transform your data to fit the `Dataset` schema.

!!! info "The schema of a Dataset is defined in Spark format"

The Dataset schemas can be found in the documentation of each Dataset. For example, the schema of the `SummaryStatistics` dataset can be found [here](../../python_api/datasets/summary_statistics.md).

You can also create a `Dataset` from a pandas DataFrame. This is useful when you want to create a `Dataset` from a small dataset that fits in memory.

```python
--8<-- "src_snippets/howto/python_api/b_create_dataset.py:create_from_pandas_import"

# Load your transformed data into a pandas DataFrame
path = "path/to/your/data"
custom_summary_stats_pandas_df = pd.read_csv(path)
--8<-- "src_snippets/howto/python_api/b_create_dataset.py:create_from_pandas"
```

## What's next?

In the next section, we will explore how to apply well-established algorithms that transform and analyse genetic data within the Gentropy framework.
36 changes: 36 additions & 0 deletions docs/howto/python_api/c_applying_methods.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
title: Applying methods
---

The available methods implement well established algorithms that transform and analyse data. Methods usually take as input predefined `Dataset`(s) and produce one or several `Dataset`(s) as output. This section explains how to apply methods to your data.

The full list of available methods can be found in the Python API [documentation](../../python_api/methods/_methods.md).

## Apply a class method

Some methods are implemented as class methods. For example, the `finemap` method is a class method of the [`PICS`](../../python_api/methods/pics.md) class. This method performs fine-mapping using the PICS algorithm. These methods usually take as input one or several `Dataset`(s) and produce one or several `Dataset`(s) as output.

```python
--8<-- "src_snippets/howto/python_api/c_applying_methods.py:apply_class_method_pics"
```

## Apply a `Dataset` instance method

Some methods are implemented as instance methods of the `Dataset` class. For example, the `window_based_clumping` method is an instance method of the `SummaryStatistics` class. This method performs window-based clumping on summary statistics.

```python
--8<-- "src_snippets/howto/python_api/c_applying_methods.py:apply_instance_method"
```

!!! info "The `window_based_clumping` method is also available as a class method"

The `window_based_clumping` method is also available as a class method of the `WindowBasedClumping` class. This method performs window-based clumping on summary statistics.

```python
# Perform window-based clumping on summary statistics
--8<-- "src_snippets/howto/python_api/c_applying_methods.py:apply_class_method_clumping"
```

## What's next?

Up next, we'll show you how to inspect your data to ensure its integrity and the success of your transformations.
37 changes: 37 additions & 0 deletions docs/howto/python_api/d_inspect_dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
title: Inspect a dataset
---

We have seen how to create and transform a `Dataset` instance. This section guides you through inspecting your data to ensure its integrity and the success of your transformations.

## Inspect data in a `Dataset`

The `df` attribute of a Dataset instance is key to interacting with and inspecting the stored data.

!!! info "By accessing the df attribute, you can apply any method that you would typically use on a PySpark DataFrame. See the [PySpark documentation](https://spark.apache.org/docs/3.1.1/api/python/reference/pyspark.sql.html#dataframe-apis) for more information."

### View data samples

```python
--8<-- "src_snippets/howto/python_api/d_inspect_dataset.py:print_dataframe"
```

This method displays the first 10 rows of your dataset, giving you a snapshot of your data's structure and content.

### Understand the schema

```python
--8<-- "src_snippets/howto/python_api/d_inspect_dataset.py:get_dataset_schema"

--8<-- "src_snippets/howto/python_api/d_inspect_dataset.py:print_dataframe"
```

## Write a `Dataset` to disk

```python
--8<-- "src_snippets/howto/python_api/d_inspect_dataset.py:write_parquet"

--8<-- "src_snippets/howto/python_api/d_inspect_dataset.py:write_csv"
```

Consider the format's compatibility with your tools, and the partitioning strategy for large datasets to optimize performance.
29 changes: 20 additions & 9 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,28 +7,39 @@ hide:

</br>

<img width="800" height="300" src="assets/imgs/gentropy.svg">
<div style="text-align: center;">
<img width="800" height="300" src="assets/imgs/gentropy.svg">
</div>

<style>
.md-typeset h1,
.md-content__button {
display: none;
}
</style>

<br>
</br>

[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/gentropy.svg)](https://pypi.python.org/pypi/gentropy/)
[![PyPI version](https://badge.fury.io/py/gentropy.svg)](https://badge.fury.io/py/gentropy)
[![image](https://github.com/opentargets/gentropy/actions/workflows/release.yaml/badge.svg)](https://opentargets.github.io/gentropy/)
[![codecov](https://codecov.io/gh/opentargets/gentropy/branch/main/graph/badge.svg?token=5ixzgu8KFP)](https://codecov.io/gh/opentargets/gentropy)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10527086.svg)](https://doi.org/10.5281/zenodo.10527086)

<p align="center">
<a href="https://github.com/astral-sh/ruff"><img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff" /></a>
<a href="https://pypi.python.org/pypi/gentropy/"><img src="https://img.shields.io/pypi/pyversions/gentropy.svg" alt="PyPI pyversions" /></a>
<a href="https://badge.fury.io/py/gentropy"><img src="https://badge.fury.io/py/gentropy.svg" alt="PyPI version" /></a>
<a href="https://opentargets.github.io/gentropy/"><img src="https://github.com/opentargets/gentropy/actions/workflows/release.yaml/badge.svg" alt="image" /></a>
<a href="https://codecov.io/gh/opentargets/gentropy"><img src="https://codecov.io/gh/opentargets/gentropy/branch/main/graph/badge.svg?token=5ixzgu8KFP" alt="codecov" /></a>
<a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License" /></a>
<a href="https://doi.org/10.5281/zenodo.10527086"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.10527086.svg" alt="DOI" /></a>
</p>
---

Open Targets Gentropy is a Python package to facilitate the interpretation and analysis of GWAS and functional genomic studies for target identification. This package contains a toolkit for the harmonisation, statistical analysis and prioritisation of genetic signals to assist drug discovery.

#### Key Features:

- **Specialized Datatypes**: Introduces essential genetics datatypes like _StudyLocus_, _LocusToGene_, and _SummaryStatistics_.
- **Performance-Oriented**: Optimized for large-scale genetic data analysis, including locus-to-gene scoring, fine mapping, and colocalization analysis.
- **User-Friendly**: The package is designed to be intuitive, allowing both beginners and experienced researchers to conduct complex genetic with ease.

## About Open Targets

Open Targets is a pre-competitive, public-private partnership that uses human genetics and genomics data to systematically identify and prioritise drug targets. Through large-scale genomic experiments and the development of innovative computational techniques, the partnership aims to help researchers select the best targets for the development of new therapies. For more information, visit the Open Targets [website](https://www.opentargets.org).
32 changes: 32 additions & 0 deletions docs/src_snippets/howto/python_api/a_creating_spark_session.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
"""Docs to create a default Spark Session."""
from gentropy.common.session import Session


def default_session() -> Session:
"""Create a default Spark Session.
Returns:
Session: Spark Session.
"""
# --8<-- [start:default_session]
from gentropy.common.session import Session

# Create a default Spark Session
session = Session()
# --8<-- [end:default_session]
return session


def custom_session() -> Session:
"""Create a custom Spark Session.
Returns:
Session: Spark Session.
"""
# --8<-- [start:custom_session]
from gentropy.common.session import Session

# Create a Spark session with increased driver memory
session = Session(extended_spark_conf={"spark.driver.memory": "4g"})
# --8<-- [end:custom_session]
return session
59 changes: 59 additions & 0 deletions docs/src_snippets/howto/python_api/b_create_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
"""Docs to create a dataset."""
from __future__ import annotations

from typing import TYPE_CHECKING

from gentropy.common.session import Session

if TYPE_CHECKING:
from gentropy.dataset.summary_statistics import SummaryStatistics


def create_from_parquet(session: Session) -> SummaryStatistics:
"""Create a dataset from a path with parquet files."""
# --8<-- [start:create_from_parquet_import]
# Create a SummaryStatistics object by loading data from the specified path
from gentropy.dataset.summary_statistics import SummaryStatistics

# --8<-- [end:create_from_parquet_import]

path = "tests/data_samples/sumstats_sample/GCST005523_chr18.parquet"
# --8<-- [start:create_from_parquet]
summary_stats = SummaryStatistics.from_parquet(session, path)
# --8<-- [end:create_from_parquet]
return summary_stats


def create_from_source(session: Session) -> SummaryStatistics:
"""Create a dataset from a path with parquet files."""
# --8<-- [start:create_from_source_import]
# Create a SummaryStatistics object by loading raw data from Finngen
from gentropy.datasource.finngen.summary_stats import FinnGenSummaryStats

# --8<-- [end:create_from_source_import]
path = "tests/data_samples/finngen_R9_AB1_ACTINOMYCOSIS.gz"
# --8<-- [start:create_from_source]
summary_stats = FinnGenSummaryStats.from_source(session.spark, path)
# --8<-- [end:create_from_source]
return summary_stats


def create_from_pandas() -> SummaryStatistics:
"""Create a dataset from a path with Pandas files."""
# --8<-- [start:create_from_pandas_import]
import pyspark.pandas as ps
from gentropy.dataset.summary_statistics import SummaryStatistics

# --8<-- [end:create_from_pandas_import]

path = "tests/data_samples/sumstats_sample/GCST005523_chr18.parquet"
custom_summary_stats_pandas_df = ps.read_parquet(path)
# --8<-- [start:create_from_pandas]

# Create a SummaryStatistics object specifying the data and schema
custom_summary_stats_df = custom_summary_stats_pandas_df.to_spark()
custom_summary_stats = SummaryStatistics(
_df=custom_summary_stats_df, _schema=SummaryStatistics.get_schema()
)
# --8<-- [end:create_from_pandas]
return custom_summary_stats
Loading

0 comments on commit 40af77f

Please sign in to comment.