docs: add gentropy first steps guide (opentargets#449)

* docs: add tutorials * docs: docufriday * docs: minor addition to how-to run class method * docs: fix typo * docs: linking available datasets, methods * docs: add `inspect_dataset` page * docs: add whatsnext sections * chore: center project badges * docs: add key features to index * test(docs): testing doc snippets proof of concept (opentargets#451) * build(deps): bump pandas from 2.1.4 to 2.2.0 (opentargets#443) Bumps [pandas](https://github.com/pandas-dev/pandas) from 2.1.4 to 2.2.0. - [Release notes](https://github.com/pandas-dev/pandas/releases) - [Commits](pandas-dev/pandas@v2.1.4...v2.2.0) --- updated-dependencies: - dependency-name: pandas dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump scikit-learn from 1.3.2 to 1.4.0 (opentargets#444) Bumps [scikit-learn](https://github.com/scikit-learn/scikit-learn) from 1.3.2 to 1.4.0. - [Release notes](https://github.com/scikit-learn/scikit-learn/releases) - [Commits](scikit-learn/scikit-learn@1.3.2...1.4.0) --- updated-dependencies: - dependency-name: scikit-learn dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * test(docs): testing doc snippets proof of concept * docs: extend and test python snippets to all docs * fix: correct typing in apply_class_method_clumping * fix: update imports in doc tests * chore: harmonizing finngen configuration (opentargets#454) * chore: harmonizing finngen configuration * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * chore: finalising finngen config update * fix: reverting .env file --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * chore(deps): bump codecov/codecov-action from 3 to 4 (opentargets#464) Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 3 to 4. - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](codecov/codecov-action@v3...v4) --- updated-dependencies: - dependency-name: codecov/codecov-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * ci: ignore patch versions in dependabot (opentargets#462) * build(deps-dev): bump ruff from 0.1.8 to 0.2.0 (opentargets#465) Bumps [ruff](https://github.com/astral-sh/ruff) from 0.1.8 to 0.2.0. - [Release notes](https://github.com/astral-sh/ruff/releases) - [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md) - [Commits](astral-sh/ruff@v0.1.8...v0.2.0) --- updated-dependencies: - dependency-name: ruff dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): bump ipython from 8.20.0 to 8.21.0 (opentargets#466) Bumps [ipython](https://github.com/ipython/ipython) from 8.20.0 to 8.21.0. - [Release notes](https://github.com/ipython/ipython/releases) - [Commits](ipython/ipython@8.20.0...8.21.0) --- updated-dependencies: - dependency-name: ipython dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps-dev): bump pytest-sugar from 0.9.7 to 1.0.0 (opentargets#467) Bumps [pytest-sugar](https://github.com/Teemu/pytest-sugar) from 0.9.7 to 1.0.0. - [Release notes](https://github.com/Teemu/pytest-sugar/releases) - [Changelog](https://github.com/Teemu/pytest-sugar/blob/main/CHANGES.rst) - [Commits](Teemu/pytest-sugar@v0.9.7...v1.0.0) --- updated-dependencies: - dependency-name: pytest-sugar dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Irene López <[email protected]> * build(deps-dev): bump google-cloud-dataproc from 5.8.0 to 5.9.0 (opentargets#468) Bumps [google-cloud-dataproc](https://github.com/googleapis/google-cloud-python) from 5.8.0 to 5.9.0. - [Release notes](https://github.com/googleapis/google-cloud-python/releases) - [Changelog](https://github.com/googleapis/google-cloud-python/blob/main/packages/google-cloud-documentai/CHANGELOG.md) - [Commits](googleapis/google-cloud-python@google-cloud-dataproc-v5.8.0...google-cloud-dataproc-v5.9.0) --- updated-dependencies: - dependency-name: google-cloud-dataproc dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Irene López <[email protected]> * fix(test): do not create session and pass it as a parameter --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Irene López <[email protected]> Co-authored-by: Daniel Suveges <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Irene López <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: David Ochoa <[email protected]> Co-authored-by: David Ochoa <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Daniel Suveges <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
thehyve · Feb 5, 2024 · 40af77f · 40af77f
1 parent 70ae88d
commit 40af77f
Show file tree

Hide file tree

Showing 22 changed files with 482 additions and 19 deletions.
diff --git a/README.md b/README.md
@@ -1,15 +1,17 @@
-[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
-[![PyPI pyversions](https://img.shields.io/pypi/pyversions/gentropy.svg)](https://pypi.python.org/pypi/gentropy/)
-[![PyPI version](https://badge.fury.io/py/gentropy.svg)](https://badge.fury.io/py/gentropy)
-[![image](https://github.com/opentargets/gentropy/actions/workflows/release.yaml/badge.svg)](https://opentargets.github.io/gentropy/)
-[![codecov](https://codecov.io/gh/opentargets/gentropy/branch/main/graph/badge.svg?token=5ixzgu8KFP)](https://codecov.io/gh/opentargets/gentropy)
-[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
-[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10527086.svg)](https://doi.org/10.5281/zenodo.10527086)
-
 <p align="center">
   <img width=100% height=250px src="https://raw.githubusercontent.com/opentargets/gentropy/dev/docs/assets/imgs/gentropy.svg">
 </p>
 
+<p align="center">
+<a href="https://github.com/astral-sh/ruff"><img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff" /></a>
+<a href="https://pypi.python.org/pypi/gentropy/"><img src="https://img.shields.io/pypi/pyversions/gentropy.svg" alt="PyPI pyversions" /></a>
+<a href="https://badge.fury.io/py/gentropy"><img src="https://badge.fury.io/py/gentropy.svg" alt="PyPI version" /></a>
+<a href="https://opentargets.github.io/gentropy/"><img src="https://github.com/opentargets/gentropy/actions/workflows/release.yaml/badge.svg" alt="image" /></a>
+<a href="https://codecov.io/gh/opentargets/gentropy"><img src="https://codecov.io/gh/opentargets/gentropy/branch/main/graph/badge.svg?token=5ixzgu8KFP" alt="codecov" /></a>
+<a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License" /></a>
+<a href="https://doi.org/10.5281/zenodo.10527086"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.10527086.svg" alt="DOI" /></a>
+</p>
+
 Open Targets Gentropy is a Python package to facilitate the interpretation and analysis of GWAS and functional genomic studies for target identification. The package contains a toolkit for the harmonisation, statistical analysis and prioritisation of genetic signals to assist drug discovery.
 
 ## Installation

diff --git a/docs/__init__.py b/docs/__init__.py
@@ -0,0 +1 @@
+"""Docs package."""
diff --git a/docs/howto/_howto.md b/docs/howto/_howto.md
@@ -2,4 +2,7 @@
 
 This page contains a collection of how-to guides for the project.
 
+- [**Command line interface**](command_line/_command_line.md): Learn how to use the Gentropy CLI.
+- [**Python API**](python_api/_python_api.md): Learn how to use the Gentropy Python package.
+
 For additional information please visit [https://community.opentargets.org/](https://community.opentargets.org/)
diff --git a/docs/howto/command_line/_command_line.md b/docs/howto/command_line/_command_line.md
@@ -0,0 +1,7 @@
+---
+Title: Command line interface
+---
+
+# Command line interface
+
+Gentropy steps can be run using the command line interface (CLI). This section contains a collection of how-to guides for the CLI.
diff --git a/docs/howto/run_step_in_cli.md → docs/howto/command_line/run_step_in_cli.md b/docs/howto/run_step_in_cli.md → docs/howto/command_line/run_step_in_cli.md
@@ -41,4 +41,4 @@ In most occassions, some mandatory values will be required to run the step. For
 gentropy step=gene_index step.target_path=/path/to/target step.gene_index_path=/path/to/gene_index
 ```
 
-You can find more about the available steps in the [documentation](../python_api/steps/_steps.md).
+You can find more about the available steps in the [documentation](../../python_api/steps/_steps.md).
diff --git a/docs/howto/run_step_using_config.md → ...wto/command_line/run_step_using_config.md b/docs/howto/run_step_using_config.md → ...wto/command_line/run_step_using_config.md
diff --git a/docs/howto/python_api/_python_api.md b/docs/howto/python_api/_python_api.md
@@ -0,0 +1,5 @@
+---
+title: Python API
+---
+
+This section explains how to use gentropy in a Python environment providing a foundational understanding on how to perform genetics analyses using the package. This section can be useful for users wishing to use Gentropy in their own projects.
diff --git a/docs/howto/python_api/a_creating_spark_session.md b/docs/howto/python_api/a_creating_spark_session.md
@@ -0,0 +1,33 @@
+---
+title: Creating a Spark Session
+---
+
+In this section, we'll guide you through creating a Spark session using Gentropy's Session class. Gentropy uses _Apache PySpark_ as the underlying framework for distributed computing. The Session class provides a convenient way to initialize a Spark session with pre-configured settings.
+
+## Creating a Default Session
+
+To begin your journey with Gentropy, start by creating a default Spark session. This is the simplest way to initialize your environment.
+
+```python
+--8<-- "src_snippets/howto/python_api/a_creating_spark_session.py:default_session"
+```
+
+The above code snippet sets up a default Spark session with pre-configured settings. This is ideal for getting started quickly without needing to tweak any configurations.
+
+## Customizing Your Spark Session
+
+Gentropy allows you to customize the Spark session to suit your specific needs. You can modify various parameters such as memory allocation, number of executors, and more. This flexibility is particularly useful for optimizing performance in steps that are more computationally intensive.
+
+### Example: Increasing Driver Memory
+
+If you require more memory for the Spark driver, you can easily adjust this setting:
+
+```python
+--8<-- "src_snippets/howto/python_api/a_creating_spark_session.py:custom_session"
+```
+
+This code snippet demonstrates how to increase the memory allocated to the Spark driver to 16 gigabytes. You can customize other Spark settings similarly, according to your project's requirements.
+
+## What's next?
+
+Now that you've created a Spark session, you're ready to start using Gentropy. In the next section, we'll show you how to process a large dataset using Gentropy's powerful _SummaryStatistics_ datatype.
diff --git a/docs/howto/python_api/b_create_dataset.md b/docs/howto/python_api/b_create_dataset.md
@@ -0,0 +1,61 @@
+---
+title: Create a dataset
+---
+
+Gentropy provides a collection of `Dataset`s that encapsulate key concepts in the field of genetics. For example, to represent summary statistics, you'll use the [`SummaryStatistics`](../../python_api/datasets/summary_statistics.md) class. This datatype comes with a set of useful operations to disentangle the genetic architecture of a trait or disease.
+
+The full list of `Dataset`s is available in the Python API [documentation](../../python_api/datasets/_datasets.md).
+
+!!! info "Any instance of Dataset will have 2 common attributes"
+
+    - **df**: the Spark DataFrame that contains the data
+    - **schema**: the definition of the data structure in Spark format
+
+In this section you'll learn the different ways of how to create a `Dataset` instances.
+
+## Creating a dataset from parquet
+
+All the `Dataset`s have a `from_parquet` method that allows you to create any `Dataset` instance from a parquet file or directory.
+
+```python
+--8<-- "src_snippets/howto/python_api/b_create_dataset.py:create_from_parquet_import"
+path = "path/to/summary/stats"
+--8<-- "src_snippets/howto/python_api/b_create_dataset.py:create_from_parquet"
+```
+
+!!! info "Parquet files"
+
+    Parquet is a columnar storage format that is widely used in the Spark ecosystem. It is the recommended format for storing large datasets. For more information about parquet, please visit [https://parquet.apache.org/](https://parquet.apache.org/).
+
+## Creating a dataset from a data source
+
+Alternatively, `Dataset`s can be created using a [data source](../../python_api/datasources/_datasources.md) harmonisation method. For example, to create a `SummaryStatistics` object from Finngen's raw summary statistics, you can use the [`FinnGen`](../../python_api/datasources/finngen/summary_stats.md) data source.
+
+```python
+--8<-- "src_snippets/howto/python_api/b_create_dataset.py:create_from_source_import"
+path = "path/to/finngen/summary/stats"
+--8<-- "src_snippets/howto/python_api/b_create_dataset.py:create_from_source"
+```
+
+## Creating a dataset from a pandas DataFrame
+
+If none of our data sources fit your needs, you can create a `Dataset` object from your own data. To do so, you need to transform your data to fit the `Dataset` schema.
+
+!!! info "The schema of a Dataset is defined in Spark format"
+
+    The Dataset schemas can be found in the documentation of each Dataset. For example, the schema of the `SummaryStatistics` dataset can be found [here](../../python_api/datasets/summary_statistics.md).
+
+You can also create a `Dataset` from a pandas DataFrame. This is useful when you want to create a `Dataset` from a small dataset that fits in memory.
+
+```python
+--8<-- "src_snippets/howto/python_api/b_create_dataset.py:create_from_pandas_import"
+
+# Load your transformed data into a pandas DataFrame
+path = "path/to/your/data"
+custom_summary_stats_pandas_df = pd.read_csv(path)
+--8<-- "src_snippets/howto/python_api/b_create_dataset.py:create_from_pandas"
+```
+
+## What's next?
+
+In the next section, we will explore how to apply well-established algorithms that transform and analyse genetic data within the Gentropy framework.
diff --git a/docs/howto/python_api/c_applying_methods.md b/docs/howto/python_api/c_applying_methods.md
@@ -0,0 +1,36 @@
+---
+title: Applying methods
+---
+
+The available methods implement well established algorithms that transform and analyse data. Methods usually take as input predefined `Dataset`(s) and produce one or several `Dataset`(s) as output. This section explains how to apply methods to your data.
+
+The full list of available methods can be found in the Python API [documentation](../../python_api/methods/_methods.md).
+
+## Apply a class method
+
+Some methods are implemented as class methods. For example, the `finemap` method is a class method of the [`PICS`](../../python_api/methods/pics.md) class. This method performs fine-mapping using the PICS algorithm. These methods usually take as input one or several `Dataset`(s) and produce one or several `Dataset`(s) as output.
+
+```python
+--8<-- "src_snippets/howto/python_api/c_applying_methods.py:apply_class_method_pics"
+```
+
+## Apply a `Dataset` instance method
+
+Some methods are implemented as instance methods of the `Dataset` class. For example, the `window_based_clumping` method is an instance method of the `SummaryStatistics` class. This method performs window-based clumping on summary statistics.
+
+```python
+--8<-- "src_snippets/howto/python_api/c_applying_methods.py:apply_instance_method"
+```
+
+!!! info "The `window_based_clumping` method is also available as a class method"
+
+    The `window_based_clumping` method is also available as a class method of the `WindowBasedClumping` class. This method performs window-based clumping on summary statistics.
+
+    ```python
+    # Perform window-based clumping on summary statistics
+    --8<-- "src_snippets/howto/python_api/c_applying_methods.py:apply_class_method_clumping"
+    ```
+
+## What's next?
+
+Up next, we'll show you how to inspect your data to ensure its integrity and the success of your transformations.
diff --git a/docs/howto/python_api/d_inspect_dataset.md b/docs/howto/python_api/d_inspect_dataset.md
@@ -0,0 +1,37 @@
+---
+title: Inspect a dataset
+---
+
+We have seen how to create and transform a `Dataset` instance. This section guides you through inspecting your data to ensure its integrity and the success of your transformations.
+
+## Inspect data in a `Dataset`
+
+The `df` attribute of a Dataset instance is key to interacting with and inspecting the stored data.
+
+!!! info "By accessing the df attribute, you can apply any method that you would typically use on a PySpark DataFrame. See the [PySpark documentation](https://spark.apache.org/docs/3.1.1/api/python/reference/pyspark.sql.html#dataframe-apis) for more information."
+
+### View data samples
+
+```python
+--8<-- "src_snippets/howto/python_api/d_inspect_dataset.py:print_dataframe"
+```
+
+This method displays the first 10 rows of your dataset, giving you a snapshot of your data's structure and content.
+
+### Understand the schema
+
+```python
+--8<-- "src_snippets/howto/python_api/d_inspect_dataset.py:get_dataset_schema"
+
+--8<-- "src_snippets/howto/python_api/d_inspect_dataset.py:print_dataframe"
+```
+
+## Write a `Dataset` to disk
+
+```python
+--8<-- "src_snippets/howto/python_api/d_inspect_dataset.py:write_parquet"
+
+--8<-- "src_snippets/howto/python_api/d_inspect_dataset.py:write_csv"
+```
+
+Consider the format's compatibility with your tools, and the partitioning strategy for large datasets to optimize performance.
diff --git a/docs/index.md b/docs/index.md
@@ -7,28 +7,39 @@ hide:
 
 </br>
 
-<img width="800" height="300" src="assets/imgs/gentropy.svg">
+<div style="text-align: center;">
+    <img width="800" height="300" src="assets/imgs/gentropy.svg">
+</div>
+
 <style>
   .md-typeset h1,
   .md-content__button {
     display: none;
   }
 </style>
 
+<br>
 </br>
 
-[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
-[![PyPI pyversions](https://img.shields.io/pypi/pyversions/gentropy.svg)](https://pypi.python.org/pypi/gentropy/)
-[![PyPI version](https://badge.fury.io/py/gentropy.svg)](https://badge.fury.io/py/gentropy)
-[![image](https://github.com/opentargets/gentropy/actions/workflows/release.yaml/badge.svg)](https://opentargets.github.io/gentropy/)
-[![codecov](https://codecov.io/gh/opentargets/gentropy/branch/main/graph/badge.svg?token=5ixzgu8KFP)](https://codecov.io/gh/opentargets/gentropy)
-[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
-[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10527086.svg)](https://doi.org/10.5281/zenodo.10527086)
-
+<p align="center">
+<a href="https://github.com/astral-sh/ruff"><img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff" /></a>
+<a href="https://pypi.python.org/pypi/gentropy/"><img src="https://img.shields.io/pypi/pyversions/gentropy.svg" alt="PyPI pyversions" /></a>
+<a href="https://badge.fury.io/py/gentropy"><img src="https://badge.fury.io/py/gentropy.svg" alt="PyPI version" /></a>
+<a href="https://opentargets.github.io/gentropy/"><img src="https://github.com/opentargets/gentropy/actions/workflows/release.yaml/badge.svg" alt="image" /></a>
+<a href="https://codecov.io/gh/opentargets/gentropy"><img src="https://codecov.io/gh/opentargets/gentropy/branch/main/graph/badge.svg?token=5ixzgu8KFP" alt="codecov" /></a>
+<a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License" /></a>
+<a href="https://doi.org/10.5281/zenodo.10527086"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.10527086.svg" alt="DOI" /></a>
+</p>
 ---
 
 Open Targets Gentropy is a Python package to facilitate the interpretation and analysis of GWAS and functional genomic studies for target identification. This package contains a toolkit for the harmonisation, statistical analysis and prioritisation of genetic signals to assist drug discovery.
 
+#### Key Features:
+
+- **Specialized Datatypes**: Introduces essential genetics datatypes like _StudyLocus_, _LocusToGene_, and _SummaryStatistics_.
+- **Performance-Oriented**: Optimized for large-scale genetic data analysis, including locus-to-gene scoring, fine mapping, and colocalization analysis.
+- **User-Friendly**: The package is designed to be intuitive, allowing both beginners and experienced researchers to conduct complex genetic with ease.
+
 ## About Open Targets
 
 Open Targets is a pre-competitive, public-private partnership that uses human genetics and genomics data to systematically identify and prioritise drug targets. Through large-scale genomic experiments and the development of innovative computational techniques, the partnership aims to help researchers select the best targets for the development of new therapies. For more information, visit the Open Targets [website](https://www.opentargets.org).
diff --git a/docs/src_snippets/howto/python_api/a_creating_spark_session.py b/docs/src_snippets/howto/python_api/a_creating_spark_session.py
@@ -0,0 +1,32 @@
+"""Docs to create a default Spark Session."""
+from gentropy.common.session import Session
+
+
+def default_session() -> Session:
+    """Create a default Spark Session.
+
+    Returns:
+        Session: Spark Session.
+    """
+    # --8<-- [start:default_session]
+    from gentropy.common.session import Session
+
+    # Create a default Spark Session
+    session = Session()
+    # --8<-- [end:default_session]
+    return session
+
+
+def custom_session() -> Session:
+    """Create a custom Spark Session.
+
+    Returns:
+        Session: Spark Session.
+    """
+    # --8<-- [start:custom_session]
+    from gentropy.common.session import Session
+
+    # Create a Spark session with increased driver memory
+    session = Session(extended_spark_conf={"spark.driver.memory": "4g"})
+    # --8<-- [end:custom_session]
+    return session
diff --git a/docs/src_snippets/howto/python_api/b_create_dataset.py b/docs/src_snippets/howto/python_api/b_create_dataset.py
@@ -0,0 +1,59 @@
+"""Docs to create a dataset."""
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+from gentropy.common.session import Session
+
+if TYPE_CHECKING:
+    from gentropy.dataset.summary_statistics import SummaryStatistics
+
+
+def create_from_parquet(session: Session) -> SummaryStatistics:
+    """Create a dataset from a path with parquet files."""
+    # --8<-- [start:create_from_parquet_import]
+    # Create a SummaryStatistics object by loading data from the specified path
+    from gentropy.dataset.summary_statistics import SummaryStatistics
+
+    # --8<-- [end:create_from_parquet_import]
+
+    path = "tests/data_samples/sumstats_sample/GCST005523_chr18.parquet"
+    # --8<-- [start:create_from_parquet]
+    summary_stats = SummaryStatistics.from_parquet(session, path)
+    # --8<-- [end:create_from_parquet]
+    return summary_stats
+
+
+def create_from_source(session: Session) -> SummaryStatistics:
+    """Create a dataset from a path with parquet files."""
+    # --8<-- [start:create_from_source_import]
+    # Create a SummaryStatistics object by loading raw data from Finngen
+    from gentropy.datasource.finngen.summary_stats import FinnGenSummaryStats
+
+    # --8<-- [end:create_from_source_import]
+    path = "tests/data_samples/finngen_R9_AB1_ACTINOMYCOSIS.gz"
+    # --8<-- [start:create_from_source]
+    summary_stats = FinnGenSummaryStats.from_source(session.spark, path)
+    # --8<-- [end:create_from_source]
+    return summary_stats
+
+
+def create_from_pandas() -> SummaryStatistics:
+    """Create a dataset from a path with Pandas files."""
+    # --8<-- [start:create_from_pandas_import]
+    import pyspark.pandas as ps
+    from gentropy.dataset.summary_statistics import SummaryStatistics
+
+    # --8<-- [end:create_from_pandas_import]
+
+    path = "tests/data_samples/sumstats_sample/GCST005523_chr18.parquet"
+    custom_summary_stats_pandas_df = ps.read_parquet(path)
+    # --8<-- [start:create_from_pandas]
+
+    # Create a SummaryStatistics object specifying the data and schema
+    custom_summary_stats_df = custom_summary_stats_pandas_df.to_spark()
+    custom_summary_stats = SummaryStatistics(
+        _df=custom_summary_stats_df, _schema=SummaryStatistics.get_schema()
+    )
+    # --8<-- [end:create_from_pandas]
+    return custom_summary_stats