-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add summary stats script #501
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There have been a few requests for summary stats for genomes - possible in a notebook but could be good to productionize/add an argument for it in this script ?
mane_select_only=True, | ||
index=freq_index, | ||
) | ||
meta_ht = meta.ht() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta_ht = meta.ht() | |
meta_ht = meta(data_type=data_type).ht() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait it doesn't work like this in this branch - did meta change from main?
ht = ht.annotate_globals(num_release_samples=meta_ht.count()) | ||
ht.write( | ||
release_summary_stats( | ||
test=test, data_type="exomes", filter_name=filter_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test=test, data_type="exomes", filter_name=filter_name | |
test=test, data_type=data_type, filter_name=filter_name |
], | ||
) | ||
mt.write( | ||
release_lof(test=test, data_type="exomes", mt=True).path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
release_lof(test=test, data_type="exomes", mt=True).path, | |
release_lof(test=test, data_type=data_type, mt=True).path, |
) | ||
|
||
if args.summarize_gene_lof_matrix: | ||
mt = release_lof(test=test, data_type="exomes", mt=True).mt() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mt = release_lof(test=test, data_type="exomes", mt=True).mt() | |
mt = release_lof(test=test, data_type=data_type, mt=True).mt() |
) | ||
ht = default_generate_gene_lof_summary(mt) | ||
ht.write( | ||
release_lof(test=test, data_type="exomes").path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
release_lof(test=test, data_type="exomes").path, | |
release_lof(test=test, data_type=data_type).path, |
"--summarize-gene-lof-matrix", | ||
help="Creates gene LoF matrix summary Table.", | ||
action="store_true", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parser.add_argument(
"--data-type",
default="exomes",
choices=["exomes","genomes"],
help="Data type (exomes or genomes) to produce summary stats for."
)
Should we get this PR in too? I can help review if needed. Want to wrap up all the stats tickets. |
Hmm so this code was instead ran in a notebook and essentially this runs get_summary_stats() to get per-callset stats (not per sample) of just about exactly what we're doing. Following this goal, I think we should maybe calculate the TOTALS of everything we're calculating per sample as sums instead from the intermediate file ? I can add the code for it - if we want per-callset instead of per-sample - stats |
so I took a run at this in #611 - just getting those stats from the existing aggregated table, while they're already there. LMK your thoughts. |
After another look, this actually looks like it's doing something a bit more complex and productionized and better versionized than the simple code I put into #611 . Running the existing summary stats method =/= calculating things from the intermediate per-sample table , though we can return 99% of the same results from it. @jkgoodrich is this worth pursuing, or is calculating plenty of per-callset stats via #611 good enough ? I think this is code you/Julia wrote Oct2023, so it should honestly be very close to being mergeable if we'd want. However, running this AND our per-sample code would involve two (expensive) aggregations over the whole dataset which is.... suboptimal ! |
I have not yet run or tested
--generate-gene-lof-matrix
or--summarize-gene-lof-matrix
since it requires a densify