Improve batch-match coverage #998

benjeffery · 2025-02-13T12:48:10Z

Fixes #972

codecov · 2025-02-13T13:03:15Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.34%. Comparing base (fbff408) to head (a063456).
Report is 9 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #998      +/-   ##
==========================================
+ Coverage   93.16%   93.34%   +0.18%     
==========================================
  Files          18       18              
  Lines        6462     6458       -4     
  Branches     1097     1095       -2     
==========================================
+ Hits         6020     6028       +8     
+ Misses        300      292       -8     
+ Partials      142      138       -4

Flag	Coverage Δ
C	`93.34% <100.00%> (+0.18%)`	⬆️
python	`95.71% <100.00%> (+0.25%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benjeffery · 2025-02-14T12:23:25Z

Ended up tweaking a few things here, for example ancestors are now packed into partitions using a greedy bin packing algorithm and the logic about how many partitions to use simplified.

jeromekelleher

Looks good, small suggested implementation improvement.

jeromekelleher · 2025-02-14T12:31:34Z

tsinfer/inference.py

        if group_index == 0:
-            partitions.append(group_ancestors)
+            partitions = [
+                group_ancestors,


stray comma causing profligate whitespace

jeromekelleher · 2025-02-14T12:32:00Z

tsinfer/inference.py

-                    current_partition.append(ancestor)
-                    current_partition_work += ancestor_lengths[ancestor]
-            partitions.append(current_partition)
+            parition_count = math.ceil(total_work / min_work_per_job)


typo "paritition" -> partition

Fixed. I'm sprinkling these in now to prove that a free-range human wrote the code.

jeromekelleher · 2025-02-14T12:34:16Z

tsinfer/inference.py

+            sorted_ancestors = sorted(
+                group_ancestors, key=lambda x: ancestor_lengths[x], reverse=True
+            )
+            partitions = []


Suggested change

partitions = []

partitions = [[] for _ in range(partition_count)]

partition_lengths = [0 for _ in range(partition_count)]

Superseded by the heap code.

jeromekelleher · 2025-02-14T12:43:55Z

tsinfer/inference.py

+
+            # Use greedy bin packing - place each ancestor in the bin with
+            # lowest total length
+            for ancestor in sorted_ancestors:


How about we use a heapq for this?

heap = [(0, []) for _ range(partition_count] for ancestor in sorted_ancestors: sum_len, partition = heapq.heappop(heap) partition.append(ancestor) sum_len += ancestor_lengths[ancestor] heapq.heappush(heap, (sum_len, partition))

I think this does the same thing, but avoids the quadratic time complexity here.

Very nice, I should have thought of this!

benjeffery · 2025-02-14T13:38:27Z

Fixed up in a063456

Remove max_num_partitions for sample batch matching

f9fba9e

benjeffery added 3 commits February 13, 2025 14:01

Bin pack ancestors in partitions

bde2058

Test force_sample_times

d9f8c71

Simplify num_samples_per_partition calc

629419b

benjeffery marked this pull request as ready for review February 14, 2025 12:22

Test stored numpy arrays in batch match

f0f3ef0

benjeffery force-pushed the batch-coverage branch from e5fa97f to f0f3ef0 Compare February 14, 2025 12:24

jeromekelleher reviewed Feb 14, 2025

View reviewed changes

Use a heap for ancestor packing

a063456

benjeffery merged commit 1aa0233 into tskit-dev:main Feb 17, 2025
12 checks passed

benjeffery deleted the batch-coverage branch February 17, 2025 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve batch-match coverage #998

Improve batch-match coverage #998

benjeffery commented Feb 13, 2025 •

edited

Loading

codecov bot commented Feb 13, 2025 •

edited

Loading

benjeffery commented Feb 14, 2025

jeromekelleher left a comment

jeromekelleher Feb 14, 2025

jeromekelleher Feb 14, 2025

benjeffery Feb 14, 2025

jeromekelleher Feb 14, 2025

benjeffery Feb 14, 2025

jeromekelleher Feb 14, 2025

benjeffery Feb 14, 2025 •

edited

Loading

benjeffery commented Feb 14, 2025

	partitions = []
	partitions = [[] for _ in range(partition_count)]
	partition_lengths = [0 for _ in range(partition_count)]

Improve batch-match coverage #998

Improve batch-match coverage #998

Conversation

benjeffery commented Feb 13, 2025 • edited Loading

codecov bot commented Feb 13, 2025 • edited Loading

Codecov Report

benjeffery commented Feb 14, 2025

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Feb 14, 2025

Choose a reason for hiding this comment

jeromekelleher Feb 14, 2025

Choose a reason for hiding this comment

benjeffery Feb 14, 2025

Choose a reason for hiding this comment

jeromekelleher Feb 14, 2025

Choose a reason for hiding this comment

benjeffery Feb 14, 2025

Choose a reason for hiding this comment

jeromekelleher Feb 14, 2025

Choose a reason for hiding this comment

benjeffery Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

benjeffery commented Feb 14, 2025

benjeffery commented Feb 13, 2025 •

edited

Loading

codecov bot commented Feb 13, 2025 •

edited

Loading

benjeffery Feb 14, 2025 •

edited

Loading