-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve batch-match coverage #998
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #998 +/- ##
==========================================
+ Coverage 93.16% 93.34% +0.18%
==========================================
Files 18 18
Lines 6462 6458 -4
Branches 1097 1095 -2
==========================================
+ Hits 6020 6028 +8
+ Misses 300 292 -8
+ Partials 142 138 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Ended up tweaking a few things here, for example ancestors are now packed into partitions using a greedy bin packing algorithm and the logic about how many partitions to use simplified. |
e5fa97f
to
f0f3ef0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, small suggested implementation improvement.
tsinfer/inference.py
Outdated
if group_index == 0: | ||
partitions.append(group_ancestors) | ||
partitions = [ | ||
group_ancestors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stray comma causing profligate whitespace
tsinfer/inference.py
Outdated
current_partition.append(ancestor) | ||
current_partition_work += ancestor_lengths[ancestor] | ||
partitions.append(current_partition) | ||
parition_count = math.ceil(total_work / min_work_per_job) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "paritition" -> partition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. I'm sprinkling these in now to prove that a free-range human wrote the code.
tsinfer/inference.py
Outdated
sorted_ancestors = sorted( | ||
group_ancestors, key=lambda x: ancestor_lengths[x], reverse=True | ||
) | ||
partitions = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partitions = [] | |
partitions = [[] for _ in range(partition_count)] | |
partition_lengths = [0 for _ in range(partition_count)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Superseded by the heap code.
|
||
# Use greedy bin packing - place each ancestor in the bin with | ||
# lowest total length | ||
for ancestor in sorted_ancestors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we use a heapq for this?
heap = [(0, []) for _ range(partition_count]
for ancestor in sorted_ancestors:
sum_len, partition = heapq.heappop(heap)
partition.append(ancestor)
sum_len += ancestor_lengths[ancestor]
heapq.heappush(heap, (sum_len, partition))
I think this does the same thing, but avoids the quadratic time complexity here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, I should have thought of this!
Fixed up in a063456 |
Fixes #972