-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample mask #896
Add sample mask #896
Conversation
ddebea8
to
e529ac0
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #896 +/- ##
=======================================
Coverage 87.04% 87.04%
=======================================
Files 5 5
Lines 1767 1767
Branches 310 310
=======================================
Hits 1538 1538
Misses 140 140
Partials 89 89
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I think the chunk_iterator could be a bit simpler and more generic by having a mask=[dim0mask, dim1mask]
argument, and each dimension mask defaults to None (np.ones in that dimension), but we can log that as a follow up issue.
Did we ever get to the root of the knotty question of whether to try to make the meaning of mask=1 and mask=0 the same between SGkit and |
Good point. We want to follow sgkit, no point in thinking any harder than that. |
Yeah, I plan to flip the masks. Will file an issue for it. |
Would appreciate a quick review here. Seems to all be working at GeL and BMRC. |
tsinfer/formats.py
Outdated
def samples_mask(self): | ||
# Samples in sgkit are individuals in tskit, so we need to expand | ||
# the mask to cover all the samples for each individual. | ||
return np.repeat(self.individuals_mask, self.ploidy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making a new cached array, can't you broadcast a view into the individuals_mask
instead, so you don't need to make a copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe not actually, since this is row-wise. Ignore me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I can't actually see anywhere that samples_mask
is used in the code (although it is in the tests)? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this isn't used directly, but thought it would be useful thing to have.
tsinfer/formats.py
Outdated
@@ -2297,9 +2299,9 @@ def __init__(self, path): | |||
self.path = path | |||
self.data = zarr.open(path, mode="r") | |||
genotypes_arr = self.data["call_genotype"] | |||
_, self._num_individuals, self.ploidy = genotypes_arr.shape | |||
_, self._num_unmasked_individuals, self.ploidy = genotypes_arr.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted "unmasked" as the number of individuals without the mask set, but instead it's the total number of individuals before masking. Would a better name be e.g. _total_num_individuals
or num_individuals_premask
, or something else maybe (also below for _num_unmasked_samples
@@ -2445,9 +2460,9 @@ def provenances_record(self): | |||
except KeyError: | |||
return np.array([], dtype=object) | |||
|
|||
@property | |||
@functools.cached_property | |||
def num_samples(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth documenting this as the number of samples that have not been masked out (equivalent to the total number of samples in the dataset if there is no masking)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, modulo naming (and I'm not sure you use samples_mask, so do we actually need it?)
Would it be simpler to assume that there is always an individual and site mask, and just try to preserve the old API with those imposed? From tsinfer's perspective, we don't care about stuff that has been masked out, and if you want to look at the raw data you go back to sgkit. So, |
Yes, I think that would be cleaner, will switch it over. |
I think this is a nice pattern we could adopt here for the general iteration: https://github.com/pystatgen/vcf-zarr-publication/blob/a01de00e36d0918a7e47fb2f8c6b3a4fd810eb66/src/zarr_afdist.py#L110 So, for iterating over haplotypes we'd do: # Use zarr arrays to get mask chunks aligned with the main data
# for convenience.
z_variant_mask = zarr.array(
variant_mask, chunks=call_genotype.chunks[0], dtype=np.int8
)
for v_chunk in range(call_genotype.cdata_shape[0]):
variant_mask_chunk = z_variant_mask.blocks[v_chunk]
count = np.sum(variant_mask_chunk)
if count > 0:
v_chunk = call_genotype.blocks[v_chunk]
for j, row in enumerate(v_chunk):
if variant_mask_chunk[j]:
yield row[sample_mask] |
|
8817753
to
5938d9a
Compare
This pretty much done - the test failure is odd, can't immediately recreate so will make the exact env that is failing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I think we need to fix the terminology here as it's horribly confusing. Let's change all the things that we currently have as x_mask
to x_select
, and reserve the work "mask" to specifically mean "mask something out if true". Also change unmasked_x
to selected_x
, I think would be a lot easier to follow.
Ok I've done the mask renaming. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a couple of simplifications and minor comment. Good to merge then.
tsinfer/formats.py
Outdated
if self._sites_mask_name is None: | ||
return np.full(self.data["variant_position"].shape, True, dtype=bool) | ||
else: | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically all of the logic for this method could be moved to the __init__
, couldn't it? That would make it possible to catch these kinds of errors at init time rather than later on.
Why not also just store the sites_select
array then rather than faffing with a cached_property? This is a read-only view, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 56df800
tsinfer/formats.py
Outdated
@@ -2333,6 +2355,26 @@ def sequence_length(self): | |||
def num_sites(self): | |||
return self._num_sites | |||
|
|||
@functools.cached_property | |||
def individuals_select(self): | |||
if self._sgkit_samples_mask_name is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as sites_select - can we just compute and store this at init time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 56df800
tsinfer/formats.py
Outdated
@@ -305,7 +305,7 @@ def zarr_summary(array): | |||
return ret | |||
|
|||
|
|||
def chunk_iterator(array, indexes=None, mask=None, dimension=0): | |||
def chunk_iterator(array, indexes=None, mask=None, orthogonal_select=None, dimension=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mask should be select
here, it's being used in the wrong sense currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 3fdf3b9
@Mergifyio rebase |
☑️ Nothing to do
|
Comments addressed. |
Needs a couple of extra tests for weird masks, but mostly there.