Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature](datalake) Add BucketShuffleJoin support for bucketed hive tables #27784

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Nitin-Kashyap
Copy link
Contributor

@Nitin-Kashyap Nitin-Kashyap commented Nov 29, 2023

Add BucketShuffleJoin support for bucketed hive tables generated by Spark. (27783)

Proposed changes

Issue Number: close #27783

1. Original planner updated to consider BucketShuffle for bucketed hive table
2. Neerids planner updated for bucketShuffle join on hive tables.
3. Added spark style hash calculation in BE for shuffle on one side.

###Sample Output:s
NeredisPlanner
OldPlanner

image001

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

be/src/vec/columns/column_decimal.cpp Outdated Show resolved Hide resolved
be/src/vec/columns/column_map.cpp Show resolved Hide resolved
be/src/vec/columns/column_string.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from b4464d4 to f9e42ab Compare November 30, 2023 04:48
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

be/src/vec/columns/column_vector.cpp Outdated Show resolved Hide resolved
@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch 2 times, most recently from ed212e1 to eaf29b0 Compare November 30, 2023 05:47
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@morningman morningman self-assigned this Nov 30, 2023
@morningman
Copy link
Contributor

Hi @Nitin-Kashyap , thanks for your contribution.
Could you please provide some create table stmt of hive table on spark side,
so that we can test this case?

@morningman
Copy link
Contributor

BTW, is it only suitable for "spark created" hive bucket table?
What if the hive table is created by other system with different hash function?

@Nitin-Kashyap
Copy link
Contributor Author

Nitin-Kashyap commented Dec 1, 2023

Hi @Nitin-Kashyap , thanks for your contribution. Could you please provide some create table stmt of hive table on spark side, so that we can test this case?

@morningman Please find the sample test I used for this case: -

CREATE TABLE parquet_test (
     user_id INT,
     key       VARCHAR(20),
     part      VARCAHAR(10)
)
USING parquet
PARTITIONED BY (part)
CLUSTERED BY (user_id) INTO 3 BUCKETS;

INSERT INTO parquet_test2 VALUES (31, 'U31', 'IN'),  (11,'U11','IN'), (21, 'U21', 'IN');

@Nitin-Kashyap
Copy link
Contributor Author

Nitin-Kashyap commented Dec 1, 2023

BTW, is it only suitable for "spark created" hive bucket table? What if the hive table is created by other system with different hash function?

@morningman Yes, for current scope it will understand only Spark created bucketed table, it identifies this by Properties defined by spark for bucket specification.

I plan to take up supporting for Hive, Hudi as well in some time (hopefully in next PR); for this I have left a place holder THashType [HIVE_MOD: Hive and Hudi use the same hash method] however for hudi some more changes on FE side need to do for identifing type bucket id from file path.

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from eaf29b0 to 34c701c Compare December 2, 2023 12:19
Copy link
Contributor

github-actions bot commented Dec 2, 2023

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

github-actions bot commented Dec 2, 2023

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 34c701c to d25350a Compare December 4, 2023 05:05
Copy link
Contributor

github-actions bot commented Dec 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

be/src/vec/utils/util.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

github-actions bot commented Dec 4, 2023

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap
Copy link
Contributor Author

run buildall

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 5c27041 to 4a57ca3 Compare December 13, 2024 06:12
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 4a57ca3 to 471a7c5 Compare December 13, 2024 06:37
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 471a7c5 to 714534c Compare December 13, 2024 08:29
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 714534c to 10db37d Compare December 13, 2024 08:39
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 10db37d to 7784db9 Compare December 13, 2024 08:52
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 7784db9 to 3780f23 Compare December 31, 2024 14:53
@924060929
Copy link
Contributor

You should support the enable_fallback_to_original_planner=true in master branch

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 3780f23 to 2c88dee Compare February 3, 2025 07:22
Nitin-Kashyap and others added 2 commits February 3, 2025 15:32
… generated by Spark. (27783)

    1. Original planner updated to consider BucketShuffle for bucketed hive table
    2. Neerids planner updated for bucketShuffle join on hive tables.
    3. Added spark style hash calculation in BE for shuffle on one side.
    4. Added shuffle hash selection based on left(non-shuffling) side.
@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch 2 times, most recently from 8438d6e to 6a9f9b6 Compare February 3, 2025 11:07
@Nitin-Kashyap
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 42.02% (10992/26158)
Line Coverage: 32.30% (92801/287271)
Region Coverage: 31.45% (47584/151300)
Branch Coverage: 27.49% (24089/87632)
Coverage Report: http://coverage.selectdb-in.cc/coverage/6a9f9b65f06e596819d0f4a0c2735a3f72dac9ad_6a9f9b65f06e596819d0f4a0c2735a3f72dac9ad/report/index.html

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from 6a9f9b6 to d68ca64 Compare February 3, 2025 16:13
@Nitin-Kashyap
Copy link
Contributor Author

run build all

@Nitin-Kashyap Nitin-Kashyap force-pushed the feature-hiveBucketShuffle branch from d68ca64 to f5af994 Compare February 3, 2025 16:17
@Nitin-Kashyap
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32237 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f5af9949369589b49af19a7c592b281a686e870d, data reload: false

------ Round 1 ----------------------------------
q1	17845	5596	5417	5417
q2	2066	302	181	181
q3	10712	1236	725	725
q4	10315	963	533	533
q5	9028	2382	2162	2162
q6	202	163	130	130
q7	902	752	593	593
q8	9234	1332	1176	1176
q9	5150	4906	4892	4892
q10	6821	2335	1890	1890
q11	454	279	256	256
q12	343	356	209	209
q13	17761	3715	3088	3088
q14	230	234	217	217
q15	515	485	460	460
q16	614	628	577	577
q17	549	858	314	314
q18	7028	6462	6419	6419
q19	2069	944	524	524
q20	300	313	183	183
q21	2800	2146	1985	1985
q22	370	332	306	306
Total cold run time: 105308 ms
Total hot run time: 32237 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5605	5486	5511	5486
q2	234	322	233	233
q3	2244	2716	2320	2320
q4	1356	1821	1373	1373
q5	4282	4747	4723	4723
q6	172	164	127	127
q7	2096	1967	1773	1773
q8	2615	2884	2751	2751
q9	7325	7224	7338	7224
q10	2988	3297	2810	2810
q11	572	510	493	493
q12	662	730	569	569
q13	3567	4007	3344	3344
q14	284	290	263	263
q15	513	481	462	462
q16	665	669	645	645
q17	1235	1726	1269	1269
q18	7610	7724	7332	7332
q19	796	1119	1059	1059
q20	2021	2037	1927	1927
q21	5707	5047	4957	4957
q22	605	607	581	581
Total cold run time: 53154 ms
Total hot run time: 51721 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 191679 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f5af9949369589b49af19a7c592b281a686e870d, data reload: false

query1	1360	962	959	959
query2	6216	1968	1971	1968
query3	10987	4401	4528	4401
query4	61077	29276	23126	23126
query5	5499	616	455	455
query6	442	216	189	189
query7	5539	521	306	306
query8	334	255	232	232
query9	8155	2658	2649	2649
query10	451	313	283	283
query11	17553	15020	15621	15020
query12	166	125	109	109
query13	1413	543	395	395
query14	10954	7372	6802	6802
query15	214	196	196	196
query16	6861	644	469	469
query17	1088	716	565	565
query18	1473	384	322	322
query19	202	186	168	168
query20	123	122	111	111
query21	212	126	103	103
query22	4674	4491	4619	4491
query23	34219	33375	33893	33375
query24	5534	2299	2312	2299
query25	448	449	391	391
query26	647	275	153	153
query27	1590	492	332	332
query28	4374	2530	2525	2525
query29	531	554	436	436
query30	212	187	162	162
query31	961	881	811	811
query32	75	59	60	59
query33	411	364	296	296
query34	731	854	524	524
query35	813	837	774	774
query36	1005	1089	941	941
query37	117	104	79	79
query38	4297	4366	4403	4366
query39	1545	1441	1442	1441
query40	201	113	103	103
query41	54	62	54	54
query42	131	104	102	102
query43	530	519	499	499
query44	1338	857	869	857
query45	195	179	164	164
query46	869	1058	647	647
query47	1913	1894	1822	1822
query48	429	410	333	333
query49	713	485	395	395
query50	654	677	400	400
query51	4308	4247	4240	4240
query52	113	103	95	95
query53	234	252	191	191
query54	489	510	452	452
query55	82	80	82	80
query56	277	255	244	244
query57	1168	1214	1186	1186
query58	268	237	235	235
query59	3138	3321	3148	3148
query60	288	273	262	262
query61	117	113	124	113
query62	722	725	682	682
query63	229	188	186	186
query64	1255	1130	763	763
query65	3262	3149	3179	3149
query66	695	425	319	319
query67	16142	15836	15513	15513
query68	5087	826	523	523
query69	479	287	254	254
query70	1221	1154	1140	1140
query71	420	290	262	262
query72	6032	3877	3879	3877
query73	803	752	367	367
query74	10154	9185	8719	8719
query75	3193	3161	2650	2650
query76	3758	1177	765	765
query77	496	395	274	274
query78	10023	9930	9368	9368
query79	3470	801	602	602
query80	783	519	447	447
query81	515	279	241	241
query82	1134	152	127	127
query83	160	172	149	149
query84	294	93	77	77
query85	747	362	292	292
query86	385	305	308	305
query87	4480	4635	4355	4355
query88	4735	2203	2168	2168
query89	409	330	291	291
query90	1602	188	191	188
query91	129	135	107	107
query92	64	97	52	52
query93	2954	845	534	534
query94	728	393	293	293
query95	328	274	257	257
query96	495	624	293	293
query97	2845	2843	2750	2750
query98	219	202	190	190
query99	1636	1338	1245	1245
Total cold run time: 312095 ms
Total hot run time: 191679 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.69 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f5af9949369589b49af19a7c592b281a686e870d, data reload: false

query1	0.03	0.04	0.03
query2	0.08	0.04	0.04
query3	0.23	0.06	0.05
query4	1.66	0.08	0.08
query5	0.45	0.41	0.39
query6	1.16	0.66	0.65
query7	0.02	0.02	0.02
query8	0.05	0.05	0.05
query9	0.55	0.49	0.50
query10	0.56	0.56	0.56
query11	0.17	0.11	0.12
query12	0.15	0.13	0.13
query13	0.60	0.60	0.59
query14	2.74	2.74	2.75
query15	0.91	0.85	0.83
query16	0.39	0.37	0.38
query17	1.06	1.00	1.08
query18	0.18	0.18	0.19
query19	1.94	1.77	1.95
query20	0.02	0.02	0.01
query21	15.36	0.96	0.65
query22	0.76	0.78	0.69
query23	15.04	1.52	0.67
query24	2.19	0.36	0.22
query25	0.14	0.09	0.09
query26	0.28	0.19	0.18
query27	0.08	0.08	0.08
query28	13.41	1.25	0.54
query29	12.65	4.10	3.44
query30	0.24	0.08	0.05
query31	2.85	0.62	0.38
query32	3.22	0.56	0.48
query33	3.00	3.00	3.00
query34	16.43	5.14	4.52
query35	4.60	4.60	4.54
query36	0.61	0.50	0.47
query37	0.19	0.16	0.16
query38	0.16	0.16	0.14
query39	0.05	0.04	0.04
query40	0.16	0.14	0.13
query41	0.10	0.06	0.05
query42	0.06	0.04	0.05
query43	0.05	0.05	0.04
Total cold run time: 104.58 s
Total hot run time: 30.69 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Enable BucketShuffle Join for Hive tables
7 participants