New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add summary stats script #501

Open

jkgoodrich wants to merge 2 commits into main from jg/add_summary_stats

Contributor

jkgoodrich commented Oct 23, 2023

I have not yet run or tested --generate-gene-lof-matrix or --summarize-gene-lof-matrix since it requires a densify

jkgoodrich added 2 commits

October 22, 2023 19:48


          filter to only variants with defined genomes and exomes freq for test

d685aca


          Use mane_select_only=True in get_summary_counts

307cec2

matren395 reviewed

View reviewed changes

Contributor

matren395 left a comment

There have been a few requests for summary stats for genomes - possible in a notebook but could be good to productionize/add an argument for it in this script ?

gnomad_qc/v4/assessment/summary_stats.py

+                              mane_select_only=True,
+                              index=freq_index,
+                          )
+                          meta_ht = meta.ht()

Contributor

matren395 Oct 26, 2023

Suggested change

      
                        meta_ht = meta.ht()
          
                        meta_ht = meta(data_type=data_type).ht()

Contributor

matren395 Oct 26, 2023

wait it doesn't work like this in this branch - did meta change from main?

gnomad_qc/v4/assessment/summary_stats.py

+                          ht = ht.annotate_globals(num_release_samples=meta_ht.count())
+                          ht.write(
+                              release_summary_stats(
+                                  test=test, data_type="exomes", filter_name=filter_name

Contributor

matren395 Oct 26, 2023

Suggested change

      
                                test=test, data_type="exomes", filter_name=filter_name
          
                                test=test, data_type=data_type, filter_name=filter_name

gnomad_qc/v4/assessment/summary_stats.py

+                              ],
+                          )
+                          mt.write(
+                              release_lof(test=test, data_type="exomes", mt=True).path,

Contributor

matren395 Oct 26, 2023

Suggested change

      
                            release_lof(test=test, data_type="exomes", mt=True).path,
          
                            release_lof(test=test, data_type=data_type, mt=True).path,

gnomad_qc/v4/assessment/summary_stats.py

+                          )
+                      if args.summarize_gene_lof_matrix:
+                          mt = release_lof(test=test, data_type="exomes", mt=True).mt()

Contributor

matren395 Oct 26, 2023

Suggested change

      
                        mt = release_lof(test=test, data_type="exomes", mt=True).mt()
          
                        mt = release_lof(test=test, data_type=data_type, mt=True).mt()

gnomad_qc/v4/assessment/summary_stats.py

+                          )
+                          ht = default_generate_gene_lof_summary(mt)
+                          ht.write(
+                              release_lof(test=test, data_type="exomes").path,

Contributor

matren395 Oct 26, 2023

Suggested change

      
                            release_lof(test=test, data_type="exomes").path,
          
                            release_lof(test=test, data_type=data_type).path,

gnomad_qc/v4/assessment/summary_stats.py

+                      "--summarize-gene-lof-matrix",
+                      help="Creates gene LoF matrix summary Table.",
+                      action="store_true",
+                  )

Contributor

matren395 Oct 26, 2023

parser.add_argument(
    "--data-type",
    default="exomes",
    choices=["exomes","genomes"],
    help="Data type (exomes or genomes) to produce summary stats for."
)

jkgoodrich mentioned this pull request

Suggestions to per sample counts 4.1 PR #581

Merged

Contributor

KoalaQin commented May 1, 2024

Should we get this PR in too? I can help review if needed. Want to wrap up all the stats tickets.

Contributor

matren395 commented May 1, 2024

Hmm so this code was instead ran in a notebook and essentially this runs get_summary_stats() to get per-callset stats (not per sample) of just about exactly what we're doing.

Following this goal, I think we should maybe calculate the TOTALS of everything we're calculating per sample as sums instead from the intermediate file ? I can add the code for it - if we want per-callset instead of per-sample - stats

Contributor

matren395 commented May 1, 2024

Should we get this PR in too? I can help review if needed. Want to wrap up all the stats tickets.

so I took a run at this in #611 - just getting those stats from the existing aggregated table, while they're already there. LMK your thoughts.

Contributor

matren395 commented May 2, 2024 •

edited

Loading

After another look, this actually looks like it's doing something a bit more complex and productionized and better versionized than the simple code I put into #611 . Running the existing summary stats method =/= calculating things from the intermediate per-sample table , though we can return 99% of the same results from it. @jkgoodrich is this worth pursuing, or is calculating plenty of per-callset stats via #611 good enough ?

I think this is code you/Julia wrote Oct2023, so it should honestly be very close to being mergeable if we'd want. However, running this AND our per-sample code would involve two (expensive) aggregations over the whole dataset which is.... suboptimal !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet