Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs on missing data and coordinate systems, closes #1584 #1601

Merged
merged 2 commits into from
Dec 13, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1439,6 +1439,79 @@ To make the example quick, we've only simulated the first 100Kb;
a more realistic example would apply it to the exons, available
as a :ref:`annotation <sec_catalog_phosin_annotations>`.

.. _sec_tute_recapitation:

Tips, tricks, and gotchas
=========================

Here are a few things about the whole process that it might be useful to know.
Maybe this will save you some time,
or let you do new things!

.. _sec_tute_missing_data:

Missing data and coordinates
----------------------------

Suppose as above that we've simulated just a portion of a chromosome,
using the `left` and `right` arguments to `species.get_contig( )`:

.. code-block:: python

species = stdpopsim.get_species("HomSap")
model = species.get_demographic_model("Africa_1T12")
contig = species.get_contig(
"chr22", left=10e6, right=20e6, mutation_rate=model.mutation_rate
)
samples = {"AFR": 100}
engine = stdpopsim.get_engine("msprime")
ts = engine.simulate(model, contig, samples)
print(
f"Sequence length: {ts.sequence_length}\n"
f" First variant: {ts.sites_position[0]}\n"
f" Last variant: {ts.sites_position[-1]}\n"
)
# Sequence length: 50818468.0
# First variant: 10000142.0
# Last variant: 19999926.0


We would like the output to preserve the coordinate system,
so all variants we'd see in a VCF file (for instance) are between
10Mb and 20Mb. (And, if you're just getting a VCF, then no need to read
the rest of this!) However, for the tree sequence to
retain the same coordinates, it must start at position 0,
and end at the sequence length of human chromosome 22.
So, the rest of the tree sequence contains "misssing data",
which is encoded as, basically, a big "tree" where no-one is
related to anyone else on those segments (in other words,
before 10Mb and after 20Mb).

This can lead to surprising things.
For instance, the first tree (the tree describing relationships
at position 0 along the sequence) has 200 roots:

.. code-block:: python

t = ts.first()
t.num_roots
# 200

Of course, that's just one root per sample: in other words,
there's actually no trees on this portion of the genome.
If we check all the trees using the `root_threshold` argument
to :meth:`tskit.TreeSequence.trees`, then we'll correctly see
that in fact all trees have fully coalesced (as they should have,
because as discussed above, we have recapitated them):

.. code-block:: python

max([t.num_roots for t in ts.trees(root_threshold=2)])
# 1

To read more about using tree sequences,
see `tskit's documentation <https://tskit.dev/tskit/docs/latest/data-model.html>`__.

.. _sec_tute_analyses:

*******************************
Expand Down
Loading