Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash aggregate finalization parallelization #4655

Merged
merged 4 commits into from
Jan 22, 2025

Conversation

benjaminwinger
Copy link
Collaborator

Fixes kuzudb/internal#10 and #4547.

I switched back to having the partitioning hash table use linear probing, insert entries until a fixed capacity, and then empty and flush to the global partitions. The aggregation update code only works on the data within the vectors being inserted and requires that they are all available in the hash table at once, so kicking elements out of the hash table would require some complicated changes. I increased the default size of the hash table significantly to compensate as the performance was poor with hash tables of the default size.

It took a while to hunt down the bug in BaseAggregateScan::writeAggregateResultToVector. I've added an extra test to try and get slightly better coverage, but I think we really need some larger tests (the issue was caused by Vector re-use which previously we only would have encountered with results that have greater than DEFAULT_VECTOR_CAPACITY values, some of which must be null).

The performance isn't scaling well with the number of partitions at the moment. On workflows which don't need very many threads it's much faster if built with fewer partitions (e.g. on the query in #4547 I found that the total runtime was about 2x faster (3x for just the the HashAggregate code) with 16 partitions on a machine with 12 threads compared to 256 partitions (that query/dataset also seems to just be getting a maximum of ~14 worker threads, presumably due to the way that the input is being divided up). I've reduced the number of partitions to 128 as a compromise given that's the number of threads on our largest testing machine, but will work on improving that next (it should be possible to do the partitioning logically without physically partitioning the data to improve cache locality).

I've increased the default buffer pool size since anything involving aggregation was requiring a bunch of extra memory for constructing the PartitioningAggregateHashTables, but that should be able to be reverted with the optimization mentioned above as it would also reduce the minimum memory requirements.

Performance

On the query from #4547 (msmarco v2.1 1st segment with the query MATCH (b:doc) WITH tokenize(b.segment) AS tk, OFFSET(ID(b)) AS id UNWIND tk AS t RETURN STEM(t, 'porter'), id, count(*);.
Run on a 128 thread machine (But see earlier note about not scaling past 14 threads).

  • Before: 36GB peak memory usage; runtime 83.5s
  • After: 21GB peak memory usage; runtime 63.5s
  • Limited to 14 partitions: 24GB peak memory usage; runtime 47.8s
    (as an example of the performance improvements we might expect to achieve with better cache locality)

Note that memory usage has improved significantly since before the per-thread AggregateHashTables would be filled first, and exist in-memory simultaneously with the final merged global AggregateHashTable. Now the per-thread tables get merged into the global ones in small chunks.

I'd run some benchmarks using queries from ClickBench and seen an improvement of about 2-5x (I'll update with more details later).

@benjaminwinger benjaminwinger changed the title Hash aggregate parallelization 2 Hash aggregate parallelization Dec 18, 2024
@benjaminwinger benjaminwinger changed the title Hash aggregate parallelization Hash aggregate finalization parallelization Dec 18, 2024
@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from bc28abe to 141a9b1 Compare December 18, 2024 23:33

This comment was marked as duplicate.

Copy link

codecov bot commented Dec 19, 2024

Codecov Report

Attention: Patch coverage is 94.02985% with 20 lines in your changes missing coverage. Please review.

Project coverage is 86.35%. Comparing base (b70eabf) to head (31d9509).
Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
...cessor/operator/aggregate/aggregate_hash_table.cpp 94.00% 9 Missing ⚠️
...rc/processor/operator/aggregate/hash_aggregate.cpp 92.77% 6 Missing ⚠️
src/include/processor/result/factorized_table.h 66.66% 2 Missing ⚠️
src/processor/result/base_hash_table.cpp 86.66% 2 Missing ⚠️
...rocessor/operator/aggregate/aggregate_hash_table.h 94.44% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4655      +/-   ##
==========================================
+ Coverage   86.29%   86.35%   +0.05%     
==========================================
  Files        1396     1397       +1     
  Lines       59848    60082     +234     
  Branches     7372     7387      +15     
==========================================
+ Hits        51645    51881     +236     
+ Misses       8036     8035       -1     
+ Partials      167      166       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from 141a9b1 to 4e52c10 Compare December 19, 2024 14:37

This comment was marked as duplicate.

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from 4e52c10 to 0312ecf Compare December 19, 2024 19:16

This comment was marked as outdated.

@benjaminwinger
Copy link
Collaborator Author

Benchmarks adapted from https://github.com/ClickHouse/ClickBench/, run on a 128 thread runner (2x AMD EPYC 7551)

Query Baseline With Changes DuckDB as a third-party point of reference (equivalent SQL query)
MATCH (h:hits) WHERE h.SearchPhrase <> '' RETURN h.SearchPhrase, COUNT(*) AS c ORDER BY c DESC LIMIT 10; 5.1s/4.7s 2.3s/0.64s 0.91s/0.30s
MATCH (h:hits) WHERE h.SearchPhrase <> '' RETURN h.SearchEngineID, h.SearchPhrase, COUNT(*) AS c ORDER BY c DESC LIMIT 10; 6.8s/4.4s 2.7s/0.65s 0.90s/0.30s
MATCH (h:hits) RETURN h.UserID, COUNT(*) ORDER BY COUNT(*) DESC LIMIT 10; 5.7s/5.2s 2.0s/1.0s 0.63s/0.25s
MATCH (h:hits) RETURN h.UserID, h.SearchPhrase, COUNT(*) ORDER BY COUNT(*) DESC LIMIT 10; 14.4s/9.5s 3.8s/1.8s 1.5s/0.5s

Results are Cold/Hot, where hot is on subsequent queries in the same process. OS VM caches are dropped after each set of queries.

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from 0312ecf to 1590c35 Compare December 19, 2024 21:15

This comment was marked as outdated.

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from 1590c35 to 85edfff Compare December 20, 2024 16:53

This comment was marked as outdated.

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben! I have some comments which we've discussed, and you can collapse any of them that you're already working on locally.

src/processor/operator/aggregate/hash_aggregate.cpp Outdated Show resolved Hide resolved
computeVectorHashes(flatKeyVectors, unFlatKeyVectors);

auto startingNumTuples = getNumEntries();
if (startingNumTuples + numFlatTuples > maxNumHashSlots ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would extract the if check as a separate function, e.g., requireResize() and refactor it altogether with resizeHashTableIfNecessary.

Actually I have a question here on why we choose to check numTuples here? Ideally we should only check num distinct groups which should happen inside findHashSlots.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findHashSlots currently also inserts into the table, which makes things a little tricky, but I think that could be changed.

But the bigger issue is that while we could do it after figuring out the number of distinct groups, we'd have to re-run findHashSlots if we empty the hash table (even ignoring handling duplicates, the position they are inserted into may not be the same as the position they initially hashed to with linear probing, so if the original position is cleared it would break future lookups of that key).

On the other hand, with the current fixed capacity of 16384 entries (256KB with 16B entries) and always inserting <=2048 tuples at a time, it should always be the load factor check which is causing it to be emptied, so what we could do is assert that we have enough space to hold all of them, and then resize afterwards if we've exceeded the load factor after the insertions. But really all that does is make it resize once we've exceeded the load factor, instead of never exceeding the load factor, and I don't don't really know which would be preferable.
Given we insert up to 2048 at a time into a table holding up to 16384 entries, I think it would mean a load factor of 0.66-0.79 instead of a load factor of 0.54-0.66, noting that in the code we're defining the load factor as 1.5, which I'm fairly sure is inverted and really should be 0.66.

src/processor/operator/aggregate/aggregate_hash_table.cpp Outdated Show resolved Hide resolved
src/processor/operator/aggregate/hash_aggregate.cpp Outdated Show resolved Hide resolved
const std::vector<common::LogicalType>& distinctAggKeyTypes,
FactorizedTableSchema tableSchema)
: AggregateHashTable(memoryManager, std::move(keyTypes), std::move(payloadTypes),
aggregateFunctions, distinctAggKeyTypes, NUM_PARTITIONS * 1024, tableSchema.copy()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should follow a heuristic rule, which is to keep thread local HT small to fit in cache. Thus, there shouldn't be a fixed HT capacity, instead, we should dynamically calculate it based on aggregation keys and payloads, and cache size of the machine. (For cache size of the machine, one option is to access it through a library, such as libcpuid, the other option is to have a conservativ constant. The constant probably should be fine in many cases.)
For aggregation with lots of keys and payloads, so row_width is large, we may end up with a very small capacity by calculation. To avoid this, we should have a lower bound such as 2048.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably try and find a good benchmark for comparing performance with different row widths, but for now I've got it set to a minimum capacity of 2048 (or whatever fits in one 256KB block, if larger).

std::this_thread::sleep_for(std::chrono::microseconds(500));
sharedState->tryMergeQueue();
}
sharedState->finalizeAggregateHashTable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also think a bit on merging aggregate scan into aggregate? Ideally, we don't need to wait until the next pipeline to start scanning out from aggregate result.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on it, though I wonder if this would be better as a separate PR so that the rest of this can be merged first.

@benjaminwinger
Copy link
Collaborator Author

Updated performance

For the query on the msmarco dataset in the PR description the runtime is now 44 seconds with a peak memory usage of 21GB.

Clickhouse benchmarks have more modest improvements (compare with #4655 (comment)), and I encountered a segfault that I'm going to look into.

Benchmarks adapted from https://github.com/ClickHouse/ClickBench/, run on a 128 thread runner (2x AMD EPYC 7551)

Query Baseline With Changes DuckDB as a third-party point of reference (equivalent SQL query)
MATCH (h:hits) WHERE h.SearchPhrase <> '' RETURN h.SearchPhrase, COUNT(*) AS c ORDER BY c DESC LIMIT 10; 5.1s/4.7s segfault 0.91s/0.30s
MATCH (h:hits) WHERE h.SearchPhrase <> '' RETURN h.SearchEngineID, h.SearchPhrase, COUNT(*) AS c ORDER BY c DESC LIMIT 10; 6.8s/4.4s 2.5s/0.56s 0.90s/0.30s
MATCH (h:hits) RETURN h.UserID, COUNT(*) ORDER BY COUNT(*) DESC LIMIT 10; 5.7s/5.2s 1.7s/1.0s 0.63s/0.25s
MATCH (h:hits) RETURN h.UserID, h.SearchPhrase, COUNT(*) ORDER BY COUNT(*) DESC LIMIT 10; 14.4s/9.5s 3.6s/1.5s 1.5s/0.5s

Results are Cold/Hot, where hot is on subsequent queries in the same process. OS VM caches are dropped after each set of queries.

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from ba7e260 to 6814b68 Compare January 6, 2025 15:24
@ray6080
Copy link
Contributor

ray6080 commented Jan 6, 2025

Updated performance

For the query on the msmarco dataset in the PR description the runtime is now 44 seconds with a peak memory usage of 21GB.

Clickhouse benchmarks have more modest improvements (compare with #4655 (comment)), and I encountered a segfault that I'm going to look into.

Benchmarks adapted from https://github.com/ClickHouse/ClickBench/, run on a 128 thread runner (2x AMD EPYC 7551)

Query Baseline With Changes DuckDB as a third-party point of reference (equivalent SQL query)
MATCH (h:hits) WHERE h.SearchPhrase <> '' RETURN h.SearchPhrase, COUNT(*) AS c ORDER BY c DESC LIMIT 10; 5.1s/4.7s segfault 0.91s/0.30s
MATCH (h:hits) WHERE h.SearchPhrase <> '' RETURN h.SearchEngineID, h.SearchPhrase, COUNT(*) AS c ORDER BY c DESC LIMIT 10; 6.8s/4.4s 2.5s/0.56s 0.90s/0.30s
MATCH (h:hits) RETURN h.UserID, COUNT(*) ORDER BY COUNT(*) DESC LIMIT 10; 5.7s/5.2s 1.7s/1.0s 0.63s/0.25s
MATCH (h:hits) RETURN h.UserID, h.SearchPhrase, COUNT(*) ORDER BY COUNT(*) DESC LIMIT 10; 14.4s/9.5s 3.6s/1.5s 1.5s/0.5s
Results are Cold/Hot, where hot is on subsequent queries in the same process. OS VM caches are dropped after each set of queries.

I wonder if we should also benchmark a query that has a wide row in the aggregation table (more group keys and payloads)?

This comment was marked as outdated.

@ray6080 ray6080 removed the request for review from acquamarin January 6, 2025 16:26

This comment was marked as outdated.

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch 2 times, most recently from 3e5305b to eeec63a Compare January 8, 2025 21:27

This comment was marked as outdated.

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from eeec63a to 32b124c Compare January 9, 2025 15:18

This comment was marked as outdated.

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from 32b124c to d3355bb Compare January 9, 2025 22:14
Copy link

github-actions bot commented Jan 9, 2025

Benchmark Result

Master commit hash: d5acb796e384d3de68e7be15f6ba5c0ef6bcdc64
Branch commit hash: fc38b22826c4ceacea0156fb71bbf98cdd752f1e

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 704.39 655.32 49.07 (7.49%)
aggregation q28 6164.55 11494.18 -5329.63 (-46.37%)
copy node-Comment 71224.94 N/A N/A
copy node-Forum 5549.13 N/A N/A
copy node-Organisation 1241.12 N/A N/A
copy node-Person 2261.37 N/A N/A
copy node-Place 1183.83 N/A N/A
copy node-Post 29960.62 N/A N/A
copy node-Tag 1225.78 N/A N/A
copy node-Tagclass 1163.81 N/A N/A
copy rel-comment-hasCreator 56146.07 N/A N/A
copy rel-comment-hasTag 88514.81 N/A N/A
copy rel-comment-isLocatedIn 70381.95 N/A N/A
copy rel-containerOf 15167.14 N/A N/A
copy rel-forum-hasTag 4046.54 N/A N/A
copy rel-hasInterest 3146.86 N/A N/A
copy rel-hasMember 123782.98 N/A N/A
copy rel-hasModerator 1267.49 N/A N/A
copy rel-hasType 249.50 N/A N/A
copy rel-isPartOf 226.88 N/A N/A
copy rel-isSubclassOf 255.10 N/A N/A
copy rel-knows 13512.51 N/A N/A
copy rel-likes-comment 177292.34 N/A N/A
copy rel-likes-post 68811.59 N/A N/A
copy rel-organisation-isLocatedIn 272.53 N/A N/A
copy rel-person-isLocatedIn 461.85 N/A N/A
copy rel-post-hasCreator 14756.73 N/A N/A
copy rel-post-hasTag 23521.93 N/A N/A
copy rel-post-isLocatedIn 18610.71 N/A N/A
copy rel-replyOf-comment 48402.45 N/A N/A
copy rel-replyOf-post 37341.62 N/A N/A
copy rel-studyAt 847.19 N/A N/A
copy rel-workAt 1610.83 N/A N/A
filter q14 134.58 138.10 -3.52 (-2.55%)
filter q15 132.67 139.28 -6.61 (-4.74%)
filter q16 308.54 317.88 -9.34 (-2.94%)
filter q17 455.85 454.64 1.20 (0.26%)
filter q18 1988.97 1954.54 34.43 (1.76%)
filter zonemap-node 97.37 99.50 -2.13 (-2.14%)
filter zonemap-node-lhs-cast 97.81 97.78 0.03 (0.03%)
filter zonemap-node-null 93.73 94.18 -0.45 (-0.48%)
filter zonemap-rel 5733.97 5740.84 -6.87 (-0.12%)
fixed_size_expr_evaluator q07 581.96 564.24 17.71 (3.14%)
fixed_size_expr_evaluator q08 812.93 810.79 2.14 (0.26%)
fixed_size_expr_evaluator q09 810.38 809.95 0.44 (0.05%)
fixed_size_expr_evaluator q10 246.47 244.92 1.55 (0.63%)
fixed_size_expr_evaluator q11 240.40 239.03 1.37 (0.57%)
fixed_size_expr_evaluator q12 236.51 234.93 1.57 (0.67%)
fixed_size_expr_evaluator q13 1475.30 1469.46 5.85 (0.40%)
fixed_size_seq_scan q23 117.59 116.99 0.60 (0.51%)
join q29 606.01 599.57 6.44 (1.07%)
join q30 10007.99 11008.46 -1000.47 (-9.09%)
join q31 8.07 5.94 2.13 (35.81%)
join SelectiveTwoHopJoin 52.18 55.58 -3.40 (-6.12%)
ldbc_snb_ic q35 2542.93 2676.02 -133.08 (-4.97%)
ldbc_snb_ic q36 482.29 473.46 8.84 (1.87%)
ldbc_snb_is q32 4.10 5.63 -1.53 (-27.21%)
ldbc_snb_is q33 12.32 13.35 -1.03 (-7.72%)
ldbc_snb_is q34 1.24 1.20 0.04 (3.61%)
multi-rel multi-rel-large-scan 1408.77 1300.67 108.10 (8.31%)
multi-rel multi-rel-lookup 15.62 34.51 -18.89 (-54.73%)
multi-rel multi-rel-small-scan 102.34 83.97 18.37 (21.88%)
order_by q25 140.14 144.88 -4.73 (-3.27%)
order_by q26 457.98 482.19 -24.21 (-5.02%)
order_by q27 1470.85 1492.70 -21.85 (-1.46%)
recursive_join recursive-join-bidirection 306.76 283.60 23.16 (8.17%)
recursive_join recursive-join-dense 7313.79 5441.75 1872.05 (34.40%)
recursive_join recursive-join-path 24405.65 23804.62 601.03 (2.52%)
recursive_join recursive-join-sparse 1060.08 1064.18 -4.11 (-0.39%)
recursive_join recursive-join-trail 7283.97 6066.14 1217.83 (20.08%)
scan_after_filter q01 181.99 178.64 3.36 (1.88%)
scan_after_filter q02 166.21 164.87 1.35 (0.82%)
shortest_path_ldbc100 q37 90.52 101.96 -11.44 (-11.22%)
shortest_path_ldbc100 q38 348.49 339.09 9.40 (2.77%)
shortest_path_ldbc100 q39 64.42 66.28 -1.86 (-2.80%)
shortest_path_ldbc100 q40 380.64 355.23 25.41 (7.15%)
var_size_expr_evaluator q03 2061.24 2056.20 5.04 (0.25%)
var_size_expr_evaluator q04 2265.40 2237.42 27.98 (1.25%)
var_size_expr_evaluator q05 2575.03 2626.86 -51.83 (-1.97%)
var_size_expr_evaluator q06 1327.12 1354.91 -27.80 (-2.05%)
var_size_seq_scan q19 1456.20 1468.18 -11.99 (-0.82%)
var_size_seq_scan q20 2655.80 2604.30 51.50 (1.98%)
var_size_seq_scan q21 2278.85 2309.84 -30.99 (-1.34%)
var_size_seq_scan q22 129.34 132.66 -3.32 (-2.50%)

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch 4 times, most recently from 1e0e0b2 to ada17f0 Compare January 17, 2025 18:48
Copy link

Benchmark Result

Master commit hash: bb6d3ec271ffb8e3ca8f7c59f3a873eae2385d77
Branch commit hash: fc3699b405047dd418940c111b24779f20e1c51b

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 691.24 648.55 42.68 (6.58%)
aggregation q28 6104.79 12080.39 -5975.61 (-49.47%)
filter q14 127.16 137.75 -10.58 (-7.68%)
filter q15 124.83 135.63 -10.79 (-7.96%)
filter q16 302.32 317.25 -14.93 (-4.71%)
filter q17 445.46 461.54 -16.08 (-3.48%)
filter q18 1884.62 1976.29 -91.67 (-4.64%)
filter zonemap-node 89.05 99.26 -10.21 (-10.29%)
filter zonemap-node-lhs-cast 89.30 100.23 -10.93 (-10.90%)
filter zonemap-node-null 85.14 96.17 -11.03 (-11.47%)
filter zonemap-rel 5863.11 5826.81 36.29 (0.62%)
fixed_size_expr_evaluator q07 590.02 580.11 9.91 (1.71%)
fixed_size_expr_evaluator q08 807.76 809.10 -1.34 (-0.17%)
fixed_size_expr_evaluator q09 828.11 809.33 18.78 (2.32%)
fixed_size_expr_evaluator q10 248.32 245.14 3.18 (1.30%)
fixed_size_expr_evaluator q11 243.15 237.50 5.65 (2.38%)
fixed_size_expr_evaluator q12 238.92 234.88 4.04 (1.72%)
fixed_size_expr_evaluator q13 1477.59 1465.19 12.40 (0.85%)
fixed_size_seq_scan q23 118.74 117.53 1.21 (1.03%)
join q29 611.38 591.63 19.75 (3.34%)
join q30 10103.91 9880.31 223.60 (2.26%)
join q31 7.11 6.95 0.16 (2.31%)
join SelectiveTwoHopJoin 53.23 58.37 -5.14 (-8.80%)
ldbc_snb_ic q35 2635.91 2606.92 28.99 (1.11%)
ldbc_snb_ic q36 478.95 463.38 15.57 (3.36%)
ldbc_snb_is q32 7.07 6.48 0.59 (9.10%)
ldbc_snb_is q33 16.15 15.61 0.54 (3.46%)
ldbc_snb_is q34 1.38 1.33 0.05 (3.73%)
multi-rel multi-rel-large-scan 1591.61 1604.33 -12.72 (-0.79%)
multi-rel multi-rel-lookup 59.91 43.77 16.14 (36.86%)
multi-rel multi-rel-small-scan 1460.83 1449.06 11.77 (0.81%)
order_by q25 132.99 136.69 -3.70 (-2.71%)
order_by q26 449.16 465.68 -16.52 (-3.55%)
order_by q27 1445.21 1479.21 -34.00 (-2.30%)
recursive_join recursive-join-bidirection 293.78 312.94 -19.16 (-6.12%)
recursive_join recursive-join-dense 7456.15 7395.65 60.50 (0.82%)
recursive_join recursive-join-path 23746.27 23451.36 294.90 (1.26%)
recursive_join recursive-join-sparse 1066.02 1066.37 -0.35 (-0.03%)
recursive_join recursive-join-trail 7399.47 7348.22 51.25 (0.70%)
scan_after_filter q01 174.20 178.91 -4.71 (-2.63%)
scan_after_filter q02 159.95 167.34 -7.39 (-4.42%)
shortest_path_ldbc100 q37 89.33 89.08 0.24 (0.27%)
shortest_path_ldbc100 q38 377.29 355.57 21.72 (6.11%)
shortest_path_ldbc100 q39 63.36 63.84 -0.48 (-0.75%)
shortest_path_ldbc100 q40 478.18 371.90 106.28 (28.58%)
var_size_expr_evaluator q03 2076.22 2087.31 -11.09 (-0.53%)
var_size_expr_evaluator q04 2189.65 2269.74 -80.09 (-3.53%)
var_size_expr_evaluator q05 2590.44 5299.71 -2709.26 (-51.12%)
var_size_expr_evaluator q06 1316.76 1332.67 -15.92 (-1.19%)
var_size_seq_scan q19 1449.58 1455.58 -6.01 (-0.41%)
var_size_seq_scan q20 2778.13 2669.81 108.32 (4.06%)
var_size_seq_scan q21 2342.94 2310.84 32.10 (1.39%)
var_size_seq_scan q22 127.73 128.30 -0.57 (-0.44%)

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch 2 times, most recently from 180f882 to b6dd897 Compare January 20, 2025 15:09
@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from b6dd897 to 02def71 Compare January 20, 2025 15:55
Copy link

Benchmark Result

Master commit hash: b70eabf68b13f248c788cde379177601b012bb0d
Branch commit hash: 80c3de18d829bde573678142711164f215d1f1b7

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 682.94 621.57 61.37 (9.87%)
aggregation q28 6122.99 12092.78 -5969.79 (-49.37%)
filter q14 119.61 117.16 2.46 (2.10%)
filter q15 120.81 124.18 -3.37 (-2.72%)
filter q16 298.56 296.40 2.16 (0.73%)
filter q17 443.79 443.24 0.56 (0.13%)
filter q18 1925.30 1917.28 8.02 (0.42%)
filter zonemap-node 82.28 80.66 1.62 (2.00%)
filter zonemap-node-lhs-cast 82.96 83.68 -0.72 (-0.86%)
filter zonemap-node-null 78.81 80.09 -1.28 (-1.59%)
filter zonemap-rel 5854.58 5729.78 124.79 (2.18%)
fixed_size_expr_evaluator q07 564.72 574.77 -10.05 (-1.75%)
fixed_size_expr_evaluator q08 797.52 822.01 -24.49 (-2.98%)
fixed_size_expr_evaluator q09 798.31 801.70 -3.40 (-0.42%)
fixed_size_expr_evaluator q10 231.22 236.43 -5.21 (-2.20%)
fixed_size_expr_evaluator q11 224.02 229.36 -5.35 (-2.33%)
fixed_size_expr_evaluator q12 220.24 226.17 -5.93 (-2.62%)
fixed_size_expr_evaluator q13 1462.18 1467.46 -5.29 (-0.36%)
fixed_size_seq_scan q23 102.53 111.19 -8.66 (-7.79%)
join q29 652.34 615.74 36.60 (5.94%)
join q30 11851.00 10053.85 1797.15 (17.88%)
join q31 4.47 5.57 -1.10 (-19.78%)
join SelectiveTwoHopJoin 51.88 53.70 -1.82 (-3.39%)
ldbc_snb_ic q35 2650.33 2592.62 57.71 (2.23%)
ldbc_snb_ic q36 462.10 477.79 -15.69 (-3.28%)
ldbc_snb_is q32 3.92 5.47 -1.56 (-28.44%)
ldbc_snb_is q33 14.97 10.81 4.15 (38.43%)
ldbc_snb_is q34 1.41 1.37 0.05 (3.34%)
multi-rel multi-rel-large-scan 1383.76 1408.12 -24.35 (-1.73%)
multi-rel multi-rel-lookup 10.43 44.04 -33.61 (-76.32%)
multi-rel multi-rel-small-scan 94.22 70.10 24.12 (34.40%)
order_by q25 132.69 127.80 4.89 (3.83%)
order_by q26 459.48 451.16 8.32 (1.84%)
order_by q27 1468.18 1486.94 -18.77 (-1.26%)
recursive_join recursive-join-bidirection 304.01 268.91 35.10 (13.05%)
recursive_join recursive-join-dense 7360.88 7334.43 26.45 (0.36%)
recursive_join recursive-join-path 23363.83 23605.45 -241.62 (-1.02%)
recursive_join recursive-join-sparse 1059.43 1056.81 2.62 (0.25%)
recursive_join recursive-join-trail 7353.38 7374.37 -21.00 (-0.28%)
scan_after_filter q01 170.03 165.88 4.15 (2.50%)
scan_after_filter q02 154.31 151.58 2.73 (1.80%)
shortest_path_ldbc100 q37 90.88 92.98 -2.10 (-2.26%)
shortest_path_ldbc100 q38 365.10 344.89 20.21 (5.86%)
shortest_path_ldbc100 q39 63.59 65.57 -1.98 (-3.02%)
shortest_path_ldbc100 q40 405.74 457.16 -51.42 (-11.25%)
var_size_expr_evaluator q03 2062.82 2086.24 -23.42 (-1.12%)
var_size_expr_evaluator q04 2221.76 2226.05 -4.29 (-0.19%)
var_size_expr_evaluator q05 2624.56 2565.74 58.81 (2.29%)
var_size_expr_evaluator q06 1322.23 1322.53 -0.31 (-0.02%)
var_size_seq_scan q19 1443.47 1446.67 -3.21 (-0.22%)
var_size_seq_scan q20 2686.25 2639.82 46.44 (1.76%)
var_size_seq_scan q21 2300.29 2275.65 24.64 (1.08%)
var_size_seq_scan q22 124.44 125.56 -1.12 (-0.89%)

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from 43570b4 to 3e6d83a Compare January 20, 2025 22:32
@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from 3e6d83a to d6da832 Compare January 21, 2025 02:11
Copy link

Benchmark Result

Master commit hash: 81814766536fcd3c7c25cef59f93e07842d5924b
Branch commit hash: c781a3b3ec64c3754ad3da3352c2a9004ad21b2e

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 691.38 633.17 58.21 (9.19%)
aggregation q28 6099.58 11418.56 -5318.98 (-46.58%)
filter q14 118.09 117.39 0.70 (0.60%)
filter q15 118.87 123.64 -4.77 (-3.86%)
filter q16 298.73 295.51 3.22 (1.09%)
filter q17 440.44 441.08 -0.64 (-0.14%)
filter q18 1887.49 1888.28 -0.79 (-0.04%)
filter zonemap-node 80.95 80.72 0.24 (0.29%)
filter zonemap-node-lhs-cast 80.77 80.53 0.24 (0.29%)
filter zonemap-node-null 76.81 78.83 -2.02 (-2.57%)
filter zonemap-rel 5690.07 5692.84 -2.77 (-0.05%)
fixed_size_expr_evaluator q07 565.90 571.37 -5.46 (-0.96%)
fixed_size_expr_evaluator q08 797.16 806.05 -8.89 (-1.10%)
fixed_size_expr_evaluator q09 796.04 799.39 -3.35 (-0.42%)
fixed_size_expr_evaluator q10 229.46 236.17 -6.71 (-2.84%)
fixed_size_expr_evaluator q11 221.71 228.95 -7.24 (-3.16%)
fixed_size_expr_evaluator q12 222.48 226.41 -3.93 (-1.73%)
fixed_size_expr_evaluator q13 1449.45 1440.14 9.31 (0.65%)
fixed_size_seq_scan q23 105.32 114.75 -9.42 (-8.21%)
join q29 644.50 620.26 24.24 (3.91%)
join q30 10222.70 10228.93 -6.23 (-0.06%)
join q31 6.06 7.02 -0.96 (-13.62%)
join SelectiveTwoHopJoin 56.14 54.58 1.56 (2.87%)
ldbc_snb_ic q35 2525.27 2502.96 22.32 (0.89%)
ldbc_snb_ic q36 484.89 496.80 -11.91 (-2.40%)
ldbc_snb_is q32 5.48 6.13 -0.65 (-10.63%)
ldbc_snb_is q33 13.91 17.09 -3.17 (-18.57%)
ldbc_snb_is q34 1.48 1.47 0.02 (1.18%)
multi-rel multi-rel-large-scan 1365.79 1388.85 -23.05 (-1.66%)
multi-rel multi-rel-lookup 33.41 32.57 0.84 (2.59%)
multi-rel multi-rel-small-scan 97.94 95.06 2.88 (3.03%)
order_by q25 119.52 123.51 -4.00 (-3.24%)
order_by q26 456.15 443.72 12.43 (2.80%)
order_by q27 1450.57 1460.16 -9.60 (-0.66%)
recursive_join recursive-join-bidirection 308.28 282.90 25.38 (8.97%)
recursive_join recursive-join-dense 7382.69 7375.60 7.09 (0.10%)
recursive_join recursive-join-path 23412.80 23552.68 -139.88 (-0.59%)
recursive_join recursive-join-sparse 1062.87 1059.17 3.70 (0.35%)
recursive_join recursive-join-trail 7348.56 7321.35 27.21 (0.37%)
scan_after_filter q01 163.59 163.57 0.02 (0.01%)
scan_after_filter q02 148.29 148.39 -0.10 (-0.07%)
shortest_path_ldbc100 q37 87.40 95.93 -8.52 (-8.88%)
shortest_path_ldbc100 q38 374.94 406.03 -31.09 (-7.66%)
shortest_path_ldbc100 q39 62.41 64.24 -1.83 (-2.85%)
shortest_path_ldbc100 q40 463.38 462.06 1.31 (0.28%)
var_size_expr_evaluator q03 2068.52 2062.08 6.44 (0.31%)
var_size_expr_evaluator q04 2206.72 2244.61 -37.90 (-1.69%)
var_size_expr_evaluator q05 2612.32 2528.43 83.88 (3.32%)
var_size_expr_evaluator q06 1318.65 1315.94 2.71 (0.21%)
var_size_seq_scan q19 1438.78 1457.88 -19.11 (-1.31%)
var_size_seq_scan q20 2617.67 2641.56 -23.89 (-0.90%)
var_size_seq_scan q21 2267.13 2270.09 -2.96 (-0.13%)
var_size_seq_scan q22 125.23 125.99 -0.76 (-0.61%)

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great Ben. Have some minor comments.

Also can you update the numbers here with the latest? I don' think we have the seg fault now right? #4655 (comment)

src/include/processor/operator/aggregate/hash_aggregate.h Outdated Show resolved Hide resolved
src/include/common/utils.h Show resolved Hide resolved
test/test_files/agg/hash_large.test Outdated Show resolved Hide resolved
src/include/processor/operator/aggregate/hash_aggregate.h Outdated Show resolved Hide resolved
src/include/processor/operator/aggregate/hash_aggregate.h Outdated Show resolved Hide resolved
src/include/processor/operator/aggregate/hash_aggregate.h Outdated Show resolved Hide resolved
auto sourcePos = sourceStartOffset + idx;
memcpy(slot.entry, sourceTable.getTuple(sourcePos),
getTableSchema()->getNumBytesPerTuple());
// TODO: Ideally we should actually copy the overflow so that the original overflow data can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's preventing this TODO? Are you going to address it separately? or not really looking forward to address it any soon?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might have a significant impact on performance, so I wasn't rushing to complete it, but it probably should be addressed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would only work if we can get the overflow to support concurrent appends. At this point in the code that wouldn't matter, but we still wouldn't be able to free the overflow unless we also copy the overflow when partitioning.
Concurrent appends should be possible, but I was seeing a reasonably large difference in runtime, so maybe best to just leave it for now (as big as 0.5s->0.9s on one query with long strings).

Copy link

Benchmark Result

Master commit hash: bf9e8715410ea73ae063229ba470ef977d795857
Branch commit hash: 8f17e4242f3973db548b773b0e70a0904f88d97f

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 688.99 633.96 55.03 (8.68%)
aggregation q28 6118.08 11648.07 -5529.98 (-47.48%)
filter q14 126.21 120.14 6.07 (5.05%)
filter q15 127.01 116.28 10.73 (9.23%)
filter q16 304.36 300.51 3.85 (1.28%)
filter q17 449.02 441.25 7.78 (1.76%)
filter q18 1946.12 1887.94 58.18 (3.08%)
filter zonemap-node 89.95 81.93 8.02 (9.79%)
filter zonemap-node-lhs-cast 91.28 82.20 9.08 (11.05%)
filter zonemap-node-null 87.07 77.66 9.41 (12.12%)
filter zonemap-rel 5730.10 5780.30 -50.20 (-0.87%)
fixed_size_expr_evaluator q07 584.02 565.12 18.90 (3.35%)
fixed_size_expr_evaluator q08 816.72 793.90 22.82 (2.87%)
fixed_size_expr_evaluator q09 814.43 792.07 22.35 (2.82%)
fixed_size_expr_evaluator q10 249.20 229.73 19.48 (8.48%)
fixed_size_expr_evaluator q11 241.45 222.85 18.60 (8.35%)
fixed_size_expr_evaluator q12 238.89 218.54 20.35 (9.31%)
fixed_size_expr_evaluator q13 1463.62 1447.63 15.99 (1.10%)
fixed_size_seq_scan q23 131.07 107.81 23.27 (21.58%)
join q29 648.98 600.65 48.34 (8.05%)
join q30 10247.70 9938.11 309.59 (3.12%)
join q31 5.42 4.91 0.50 (10.28%)
join SelectiveTwoHopJoin 50.83 58.13 -7.30 (-12.56%)
ldbc_snb_ic q35 2628.70 2574.55 54.15 (2.10%)
ldbc_snb_ic q36 480.91 481.53 -0.63 (-0.13%)
ldbc_snb_is q32 3.91 6.58 -2.68 (-40.68%)
ldbc_snb_is q33 11.72 16.52 -4.80 (-29.04%)
ldbc_snb_is q34 1.41 1.37 0.04 (2.84%)
multi-rel multi-rel-large-scan 1547.62 1357.83 189.79 (13.98%)
multi-rel multi-rel-lookup 20.95 33.43 -12.48 (-37.33%)
multi-rel multi-rel-small-scan 100.71 68.70 32.00 (46.58%)
order_by q25 133.29 126.30 6.99 (5.53%)
order_by q26 454.36 443.82 10.54 (2.38%)
order_by q27 1457.34 1459.24 -1.91 (-0.13%)
recursive_join recursive-join-bidirection 279.69 293.45 -13.77 (-4.69%)
recursive_join recursive-join-dense 7377.48 6816.15 561.33 (8.24%)
recursive_join recursive-join-path 23373.46 23584.47 -211.01 (-0.89%)
recursive_join recursive-join-sparse 1059.51 1067.98 -8.48 (-0.79%)
recursive_join recursive-join-trail 7347.64 7345.45 2.19 (0.03%)
scan_after_filter q01 172.10 167.60 4.50 (2.68%)
scan_after_filter q02 157.16 149.10 8.06 (5.41%)
shortest_path_ldbc100 q37 91.93 92.98 -1.04 (-1.12%)
shortest_path_ldbc100 q38 383.90 373.31 10.58 (2.83%)
shortest_path_ldbc100 q39 61.23 68.03 -6.80 (-10.00%)
shortest_path_ldbc100 q40 457.23 391.89 65.34 (16.67%)
var_size_expr_evaluator q03 2065.52 2074.08 -8.56 (-0.41%)
var_size_expr_evaluator q04 2238.00 2209.73 28.26 (1.28%)
var_size_expr_evaluator q05 2539.53 2663.48 -123.95 (-4.65%)
var_size_expr_evaluator q06 1318.69 1341.79 -23.11 (-1.72%)
var_size_seq_scan q19 1439.01 1437.36 1.65 (0.11%)
var_size_seq_scan q20 2644.07 2695.85 -51.78 (-1.92%)
var_size_seq_scan q21 2276.46 2277.15 -0.69 (-0.03%)
var_size_seq_scan q22 126.89 123.80 3.09 (2.50%)

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from 8632fce to 7da5e47 Compare January 21, 2025 22:28

This comment was marked as outdated.

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from 7da5e47 to d8e8f69 Compare January 22, 2025 14:54
Copy link

Benchmark Result

Master commit hash: f33c700493fb063a15e2346839adf5bca918324d
Branch commit hash: bd55e412850a632f45e57336f0c92f1f174bf5cf

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 690.44 657.28 33.16 (5.05%)
aggregation q28 6088.84 11216.95 -5128.11 (-45.72%)
filter q14 126.80 144.53 -17.73 (-12.27%)
filter q15 127.34 148.36 -21.02 (-14.17%)
filter q16 303.71 320.43 -16.72 (-5.22%)
filter q17 445.38 469.32 -23.94 (-5.10%)
filter q18 1923.05 1947.25 -24.20 (-1.24%)
filter zonemap-node 90.84 105.04 -14.20 (-13.52%)
filter zonemap-node-lhs-cast 89.39 107.68 -18.29 (-16.98%)
filter zonemap-node-null 85.12 101.23 -16.11 (-15.91%)
filter zonemap-rel 5825.98 5908.43 -82.45 (-1.40%)
fixed_size_expr_evaluator q07 571.15 587.58 -16.42 (-2.80%)
fixed_size_expr_evaluator q08 801.70 817.30 -15.61 (-1.91%)
fixed_size_expr_evaluator q09 804.28 818.55 -14.28 (-1.74%)
fixed_size_expr_evaluator q10 237.83 253.37 -15.53 (-6.13%)
fixed_size_expr_evaluator q11 229.70 244.93 -15.23 (-6.22%)
fixed_size_expr_evaluator q12 226.25 242.01 -15.75 (-6.51%)
fixed_size_expr_evaluator q13 1453.53 1469.38 -15.85 (-1.08%)
fixed_size_seq_scan q23 111.22 128.71 -17.49 (-13.59%)
join q29 613.69 593.41 20.29 (3.42%)
join q30 10078.72 10138.13 -59.40 (-0.59%)
join q31 7.81 8.44 -0.63 (-7.46%)
join SelectiveTwoHopJoin 55.21 57.11 -1.90 (-3.33%)
ldbc_snb_ic q35 2588.39 2704.02 -115.63 (-4.28%)
ldbc_snb_ic q36 495.23 454.99 40.24 (8.85%)
ldbc_snb_is q32 3.65 6.29 -2.64 (-41.96%)
ldbc_snb_is q33 15.37 15.35 0.03 (0.17%)
ldbc_snb_is q34 1.44 1.31 0.12 (9.34%)
multi-rel multi-rel-large-scan 1317.76 1427.65 -109.89 (-7.70%)
multi-rel multi-rel-lookup 11.56 8.94 2.63 (29.38%)
multi-rel multi-rel-small-scan 95.26 92.32 2.94 (3.18%)
order_by q25 132.21 144.36 -12.16 (-8.42%)
order_by q26 454.69 477.17 -22.48 (-4.71%)
order_by q27 1459.67 1482.99 -23.32 (-1.57%)
recursive_join recursive-join-bidirection 298.29 300.64 -2.35 (-0.78%)
recursive_join recursive-join-dense 5773.47 7402.46 -1628.98 (-22.01%)
recursive_join recursive-join-path 23027.48 23583.05 -555.56 (-2.36%)
recursive_join recursive-join-sparse 1059.24 1059.89 -0.65 (-0.06%)
recursive_join recursive-join-trail 5977.64 7361.63 -1383.99 (-18.80%)
scan_after_filter q01 171.69 186.89 -15.20 (-8.13%)
scan_after_filter q02 156.62 175.50 -18.89 (-10.76%)
shortest_path_ldbc100 q37 83.46 89.07 -5.61 (-6.29%)
shortest_path_ldbc100 q38 383.28 387.00 -3.72 (-0.96%)
shortest_path_ldbc100 q39 63.50 69.29 -5.80 (-8.37%)
shortest_path_ldbc100 q40 466.92 433.42 33.50 (7.73%)
var_size_expr_evaluator q03 2081.04 2081.77 -0.74 (-0.04%)
var_size_expr_evaluator q04 2262.96 2204.12 58.83 (2.67%)
var_size_expr_evaluator q05 2638.86 2633.92 4.93 (0.19%)
var_size_expr_evaluator q06 1334.12 1346.18 -12.07 (-0.90%)
var_size_seq_scan q19 1462.43 1468.66 -6.24 (-0.42%)
var_size_seq_scan q20 2756.83 2823.73 -66.90 (-2.37%)
var_size_seq_scan q21 2313.92 2394.63 -80.71 (-3.37%)
var_size_seq_scan q22 125.86 129.54 -3.68 (-2.84%)

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from d8e8f69 to 7527417 Compare January 22, 2025 16:13
Copy link

Benchmark Result

Master commit hash: 99dd5dca0fa03b0ca9af007f794b2e85c497c9a5
Branch commit hash: 29cad62b8e34b059b18b69301257a7b9e67c4885

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 689.11 639.72 49.39 (7.72%)
aggregation q28 6090.98 11523.78 -5432.79 (-47.14%)
filter q14 126.34 129.41 -3.08 (-2.38%)
filter q15 124.35 127.07 -2.72 (-2.14%)
filter q16 301.19 305.87 -4.68 (-1.53%)
filter q17 448.13 446.23 1.90 (0.43%)
filter q18 1927.88 1926.06 1.83 (0.09%)
filter zonemap-node 89.15 88.56 0.58 (0.66%)
filter zonemap-node-lhs-cast 88.67 89.70 -1.03 (-1.15%)
filter zonemap-node-null 86.39 85.53 0.86 (1.01%)
filter zonemap-rel 5790.68 5818.59 -27.91 (-0.48%)
fixed_size_expr_evaluator q07 571.82 583.91 -12.08 (-2.07%)
fixed_size_expr_evaluator q08 804.49 808.83 -4.34 (-0.54%)
fixed_size_expr_evaluator q09 806.07 810.24 -4.17 (-0.51%)
fixed_size_expr_evaluator q10 237.46 248.17 -10.71 (-4.31%)
fixed_size_expr_evaluator q11 230.47 240.45 -9.98 (-4.15%)
fixed_size_expr_evaluator q12 228.05 236.17 -8.12 (-3.44%)
fixed_size_expr_evaluator q13 1452.10 1458.38 -6.28 (-0.43%)
fixed_size_seq_scan q23 112.61 122.21 -9.60 (-7.86%)
join q29 627.41 613.07 14.34 (2.34%)
join q30 11015.71 10249.64 766.07 (7.47%)
join q31 5.63 4.49 1.14 (25.52%)
join SelectiveTwoHopJoin 63.15 53.25 9.90 (18.59%)
ldbc_snb_ic q35 2593.60 2620.03 -26.44 (-1.01%)
ldbc_snb_ic q36 475.72 451.14 24.58 (5.45%)
ldbc_snb_is q32 3.03 6.09 -3.07 (-50.35%)
ldbc_snb_is q33 10.79 14.13 -3.34 (-23.62%)
ldbc_snb_is q34 1.45 1.37 0.09 (6.32%)
multi-rel multi-rel-large-scan 1317.62 1410.09 -92.47 (-6.56%)
multi-rel multi-rel-lookup 31.88 33.52 -1.64 (-4.89%)
multi-rel multi-rel-small-scan 86.32 91.85 -5.53 (-6.02%)
order_by q25 131.45 134.38 -2.94 (-2.19%)
order_by q26 453.97 471.36 -17.39 (-3.69%)
order_by q27 1473.78 1470.36 3.42 (0.23%)
recursive_join recursive-join-bidirection 288.62 276.38 12.24 (4.43%)
recursive_join recursive-join-dense 5404.12 7357.94 -1953.82 (-26.55%)
recursive_join recursive-join-path 23256.53 23584.85 -328.33 (-1.39%)
recursive_join recursive-join-sparse 1063.06 1065.69 -2.63 (-0.25%)
recursive_join recursive-join-trail 5852.36 7328.53 -1476.17 (-20.14%)
scan_after_filter q01 168.68 170.29 -1.60 (-0.94%)
scan_after_filter q02 157.98 156.97 1.01 (0.64%)
shortest_path_ldbc100 q37 91.94 89.83 2.10 (2.34%)
shortest_path_ldbc100 q38 408.89 372.42 36.48 (9.80%)
shortest_path_ldbc100 q39 64.54 63.85 0.69 (1.08%)
shortest_path_ldbc100 q40 357.97 474.96 -116.99 (-24.63%)
var_size_expr_evaluator q03 2103.10 2087.52 15.58 (0.75%)
var_size_expr_evaluator q04 2256.63 2233.12 23.51 (1.05%)
var_size_expr_evaluator q05 2639.71 2644.47 -4.76 (-0.18%)
var_size_expr_evaluator q06 1328.85 1326.85 2.00 (0.15%)
var_size_seq_scan q19 1462.37 1451.82 10.55 (0.73%)
var_size_seq_scan q20 2755.56 2717.07 38.49 (1.42%)
var_size_seq_scan q21 2309.22 2299.77 9.46 (0.41%)
var_size_seq_scan q22 126.14 126.94 -0.80 (-0.63%)

@benjaminwinger
Copy link
Collaborator Author

benjaminwinger commented Jan 22, 2025

Edit: Missed one thing when rebasing (one of the changes from #4709 needed to be done to a function newly added in this PR), so these benchmarks aren't fully up to date, however I'm not convinced that the change is significant. There was a large improvement the first time I ran the criterion benchmark again, and then it regressed in the other direction to be more or less the same as the initial benchmarks when I ran it again without changes

Updated performance (again)

For the query on the msmarco dataset in the PR description the runtime is now 41 seconds with a peak memory usage of 21GB.

Clickhouse benchmarks have more modest improvements (compare with #4655 (comment)), and I encountered a segfault that I'm going to look into.

Benchmarks adapted from https://github.com/ClickHouse/ClickBench/, run on a 128 thread runner (2x AMD EPYC 7551)

Query Baseline With Changes (128 threads) DuckDB as a third-party point of reference (equivalent SQL query)
MATCH (h:hits) WHERE h.SearchPhrase <> '' RETURN h.SearchPhrase, COUNT(*) AS c ORDER BY c DESC LIMIT 10; 5.1s/4.7s 2.1s/0.46s 0.91s/0.30s
MATCH (h:hits) WHERE h.SearchPhrase <> '' RETURN h.SearchEngineID, h.SearchPhrase, COUNT(*) AS c ORDER BY c DESC LIMIT 10; 6.8s/4.4s 2.4s/0.56s 0.90s/0.30s
MATCH (h:hits) RETURN h.UserID, COUNT(*) ORDER BY COUNT(*) DESC LIMIT 10; 5.7s/5.2s 1.5s/1.0s 0.63s/0.25s
MATCH (h:hits) RETURN h.UserID, h.SearchPhrase, COUNT(*) ORDER BY COUNT(*) DESC LIMIT 10; 14.4s/9.5s 3.7s/1.2s 1.5s/0.5s

Results are Cold/Hot, where hot is on subsequent queries in the same process. OS VM caches are dropped after each set of queries.

Performance scaling

Below are some benchmarks showing scaling. Violin plots to show the distribution; the queries were run 10 times with some warmups. The results show some discrepancies from the above results, which might be a measuring issue, but I think do a good job of showing how well it scales. The click benchmarks were done through the python API and the below were through the rust API using criterion.rs to produce the plots.
Note that the thread count is cut off on all but the third query due to length, but including the query in the benchmark name was the easiest way of getting it into the plot.

query4
query3
query1
query2

@benjaminwinger benjaminwinger force-pushed the hash-aggregate-parallelization-2 branch from 7527417 to 31d9509 Compare January 22, 2025 21:50
Copy link

Benchmark Result

Master commit hash: 99dd5dca0fa03b0ca9af007f794b2e85c497c9a5
Branch commit hash: 830c9a3c793e6b3dcaff8f20e015329b3e2b4cc5

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 694.22 639.72 54.50 (8.52%)
aggregation q28 6079.83 11523.78 -5443.95 (-47.24%)
filter q14 125.77 129.41 -3.64 (-2.81%)
filter q15 130.33 127.07 3.27 (2.57%)
filter q16 301.51 305.87 -4.36 (-1.43%)
filter q17 445.86 446.23 -0.38 (-0.08%)
filter q18 1908.36 1926.06 -17.70 (-0.92%)
filter zonemap-node 89.81 88.56 1.24 (1.40%)
filter zonemap-node-lhs-cast 90.46 89.70 0.76 (0.85%)
filter zonemap-node-null 87.50 85.53 1.98 (2.31%)
filter zonemap-rel 5800.37 5818.59 -18.23 (-0.31%)
fixed_size_expr_evaluator q07 574.87 583.91 -9.03 (-1.55%)
fixed_size_expr_evaluator q08 804.72 808.83 -4.11 (-0.51%)
fixed_size_expr_evaluator q09 806.01 810.24 -4.23 (-0.52%)
fixed_size_expr_evaluator q10 239.44 248.17 -8.72 (-3.51%)
fixed_size_expr_evaluator q11 232.24 240.45 -8.21 (-3.41%)
fixed_size_expr_evaluator q12 229.12 236.17 -7.05 (-2.98%)
fixed_size_expr_evaluator q13 1459.31 1458.38 0.93 (0.06%)
fixed_size_seq_scan q23 112.45 122.21 -9.76 (-7.99%)
join q29 638.53 613.07 25.46 (4.15%)
join q30 10234.82 10249.64 -14.82 (-0.14%)
join q31 7.48 4.49 2.99 (66.68%)
join SelectiveTwoHopJoin 54.05 53.25 0.80 (1.51%)
ldbc_snb_ic q35 2556.07 2620.03 -63.96 (-2.44%)
ldbc_snb_ic q36 485.42 451.14 34.28 (7.60%)
ldbc_snb_is q32 7.45 6.09 1.35 (22.19%)
ldbc_snb_is q33 10.26 14.13 -3.87 (-27.41%)
ldbc_snb_is q34 1.79 1.37 0.43 (31.21%)
multi-rel multi-rel-large-scan 1512.33 1410.09 102.24 (7.25%)
multi-rel multi-rel-lookup 21.31 33.52 -12.21 (-36.42%)
multi-rel multi-rel-small-scan 96.78 91.85 4.93 (5.37%)
order_by q25 134.93 134.38 0.55 (0.41%)
order_by q26 452.46 471.36 -18.90 (-4.01%)
order_by q27 1457.35 1470.36 -13.01 (-0.89%)
recursive_join recursive-join-bidirection 283.84 276.38 7.46 (2.70%)
recursive_join recursive-join-dense 7386.71 7357.94 28.77 (0.39%)
recursive_join recursive-join-path 23689.31 23584.85 104.46 (0.44%)
recursive_join recursive-join-sparse 1061.67 1065.69 -4.02 (-0.38%)
recursive_join recursive-join-trail 7348.71 7328.53 20.18 (0.28%)
scan_after_filter q01 173.65 170.29 3.37 (1.98%)
scan_after_filter q02 158.32 156.97 1.35 (0.86%)
shortest_path_ldbc100 q37 98.75 89.83 8.92 (9.93%)
shortest_path_ldbc100 q38 410.41 372.42 38.00 (10.20%)
shortest_path_ldbc100 q39 63.11 63.85 -0.74 (-1.16%)
shortest_path_ldbc100 q40 461.54 474.96 -13.41 (-2.82%)
var_size_expr_evaluator q03 2078.28 2087.52 -9.24 (-0.44%)
var_size_expr_evaluator q04 2263.20 2233.12 30.08 (1.35%)
var_size_expr_evaluator q05 2636.16 2644.47 -8.31 (-0.31%)
var_size_expr_evaluator q06 1325.93 1326.85 -0.92 (-0.07%)
var_size_seq_scan q19 1461.23 1451.82 9.41 (0.65%)
var_size_seq_scan q20 2731.86 2717.07 14.79 (0.54%)
var_size_seq_scan q21 2288.25 2299.77 -11.51 (-0.50%)
var_size_seq_scan q22 128.13 126.94 1.19 (0.94%)

@benjaminwinger benjaminwinger merged commit 10c3956 into master Jan 22, 2025
25 checks passed
@benjaminwinger benjaminwinger deleted the hash-aggregate-parallelization-2 branch January 22, 2025 23:16
@benjaminwinger benjaminwinger mentioned this pull request Jan 22, 2025
73 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants