From a98a0195d49b9fce6442c7dfb7214d422a067678 Mon Sep 17 00:00:00 2001 From: Yan Wong Date: Wed, 28 Aug 2024 15:45:09 +0100 Subject: [PATCH] Add example of excluding chromosomes using site masks And switch to the preferred `zarr.open` instead of `zarr.load`. --- .../example_data.vcz/contig_id/.zarray | 4 +- docs/_static/example_data.vcz/contig_id/0 | Bin 20 -> 32 bytes docs/usage.md | 48 ++++++++++++++---- 3 files changed, 39 insertions(+), 13 deletions(-) diff --git a/docs/_static/example_data.vcz/contig_id/.zarray b/docs/_static/example_data.vcz/contig_id/.zarray index ce072443..011f6ab3 100644 --- a/docs/_static/example_data.vcz/contig_id/.zarray +++ b/docs/_static/example_data.vcz/contig_id/.zarray @@ -9,8 +9,8 @@ "id": "blosc", "shuffle": 1 }, - "dtype": "`. + ### Masks -Sites which are not used for inference will still be included in the final tree sequence, with -mutations at those sites being placed onto branches by parsimony. However, it is also possible -to completely exclude sites and samples from the final tree sequence, by specifing a `site_mask` -and/or a `sample_mask` when creating the `VariantData` object. Such sites or samples will be -completely omitted from both inference and the final tree sequence. This can be useful, for -example, to reduce the amount of computation required for an inference. +It is also possible to *completely* exclude sites and samples, by specifing a boolean +`site_mask` and/or a `sample_mask` when creating the `VariantData` object. Sites or samples with +a mask value of `True` will be completely omitted both from inference and the final tree sequence. +This can be useful, for example, if your VCF file contains multiple chromosomes (in which case +`tsinfer` will need to be run separately on each chromosome) or if you wish to select only a subset +of the chromosome for inference (e.g. to reduce computational load). If a `site_mask` is provided, +note that the ancestral alleles array only specifies alleles for the unmasked sites. + +Below, for instance, is an example of including only sites up to position six in the contig +labelled "chr1" in the `example_data.vcz` file: + +```{code-cell} +import numpy as np + +# mask out any sites not associated with the contig named "chr1" +# (for demonstration: all sites in this .vcz file are from "chr1" anyway) +chr1_index = np.where(ds.contig_id[:] == "chr1")[0] +site_mask = ds.variant_contig[:] != chr1_index +# also mask out any sites with a position >= 6 +site_mask[ds.variant_position[:] >= 6] = True + +smaller_vdata = tsinfer.VariantData( + "_static/example_data.vcz", + ancestral_allele=ancestral_allele[site_mask == False], + site_mask=site_mask, +) +print(f"The `smaller_vdata` object returns data for only {smaller_vdata.num_sites} sites") +``` ### Topology inference