Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Support upper bound in dynamic bucket mode #4974

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

liyubin117
Copy link
Contributor

@liyubin117 liyubin117 commented Jan 21, 2025

Purpose

In dynamic bucket mode, unlimited buckets lead to an unpredicable number of small files, which lead to stability problems. so we should support upper bound in dynamic bucket mode.

Linked issue: close #4942

Tests

HashBucketAssignerTest
SimpleHashBucketAssignerTest

  • Use new feature, Produce 3 buckets after inserting 12 rows
image image
  • Not use new feature, Produce 6 buckets after inserting 12 rows
image image

API and Format

Documentation

docs/layouts/shortcodes/generated/core_configuration.html

@liyubin117
Copy link
Contributor Author

@JingsongLi CI passed, PTAL, thanks!

@liyubin117 liyubin117 force-pushed the max_dynamic_bucket branch 3 times, most recently from c7b7755 to df576e2 Compare January 22, 2025 10:42
@liyubin117 liyubin117 requested a review from JingsongLi January 23, 2025 05:54
Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if these modifications are effective, so let me give you my suggestion:
We only need to modify PartitionIndex.asign inside:

Only code:

// 3. create a new bucket
for (int i = 0; i < Short.MAX_VALUE; i++) {
    if (bucketFilter.test(i) && !totalBucket.contains(i)) {
        hash2Bucket.put(hash, (short) i);
        nonFullBucketInformation.put(i, 1L);
        totalBucket.add(i);
        return i;
    }
}

// 4. too many buckets, throw exception
@SuppressWarnings("OptionalGetWithoutIsPresent")
int maxBucket = totalBucket.stream().mapToInt(Integer::intValue).max().getAsInt();
throw new RuntimeException(
        String.format(
                "Too more bucket %s, you should increase target bucket row number %s.",
                maxBucket, targetBucketRowNumber));

New code:

// 3. create a new bucket
for (int i = 0; i < max_buckets; i++) {
    if (bucketFilter.test(i) && !totalBucket.contains(i)) {
        hash2Bucket.put(hash, (short) i);
        nonFullBucketInformation.put(i, 1L);
        totalBucket.add(i);
        return i;
    }
}

// 4. exceed max_buckets, just pick a bucket for record.
pick a min bucket (belongs to this task) to the record.

@liyubin117
Copy link
Contributor Author

liyubin117 commented Jan 23, 2025

After offline discussion with @JingsongLi , we have reached a consesus: We can't just update PartitionIndex logic because it doesn't handle SimpleHashBucketAssigner; When the buckets are full, a random bucket is selected for writing.

@liyubin117
Copy link
Contributor Author

@JingsongLi The feature has been completed as discussed, Looking forward your review !

nonFullBucketInformation.put(i, 1L);
totalBucket.add(i);
return i;
if (-1 == maxBucketsNum || totalBucket.size() < maxBucketsNum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here should be: max bucket should be smaller than maxBucketsNum.

If it is a job restarted from scratch, each task is increasing, and the previous judgment may be problematic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wise consideration! done :)

@@ -137,4 +153,42 @@ public static PartitionIndex loadIndex(
}
return new PartitionIndex(mapBuilder.build(), buckets, targetBucketRowNumber);
}

public static int[] getMaxBucketsPerAssigner(int maxBuckets, int assigners) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot understand getMaxBucketsPerAssigner and getSpecifiedMaxBuckets. Why not just same to maxBucketId < maxBucketsNum - 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, After offline discussion, we have reached a consesus, done :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Support upper bound in dynamic bucket mode
3 participants