Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](parquet) parse dict page in ColumnChunkReader::init #46372

Closed

Conversation

suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Jan 3, 2025

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #45740

Problem Summary:
This code can be removed after #45740 merged. Because now we can correctly determine whether the dictionary page exists.

    if (!_dict_checked) {
        _dict_checked = true;
        const tparquet::PageHeader* header;
        RETURN_IF_ERROR(_page_reader->get_page_header(header));
        if (header->type == tparquet::PageType::DICTIONARY_PAGE) {
            // the first page maybe directory page even if _metadata.__isset.dictionary_page_offset == false,
            // so we should parse the directory page in next_page()
            RETURN_IF_ERROR(_decode_dict_page());
            // parse the real first data page
            return next_page();
        }
    }

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Jan 3, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32845 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 28b7cca9401af13d4a416bdbc6a2455a037924ad, data reload: false

------ Round 1 ----------------------------------
q1	17575	6230	6068	6068
q2	2055	308	162	162
q3	10443	1261	768	768
q4	10276	863	442	442
q5	8575	2196	2018	2018
q6	208	179	146	146
q7	898	743	599	599
q8	9237	1360	1159	1159
q9	5266	4915	4900	4900
q10	6774	2323	1869	1869
q11	479	282	265	265
q12	343	366	220	220
q13	17787	3705	3154	3154
q14	226	230	216	216
q15	551	500	501	500
q16	628	617	595	595
q17	555	847	315	315
q18	6833	6407	6504	6407
q19	2590	961	553	553
q20	306	314	188	188
q21	2864	2287	1995	1995
q22	357	332	306	306
Total cold run time: 104826 ms
Total hot run time: 32845 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6371	6279	6262	6262
q2	232	321	236	236
q3	2213	2635	2337	2337
q4	1397	1795	1353	1353
q5	4322	4748	4923	4748
q6	187	175	142	142
q7	2090	1949	1843	1843
q8	2692	2820	2672	2672
q9	7300	7367	7269	7269
q10	3096	3320	2798	2798
q11	566	506	488	488
q12	638	757	614	614
q13	3442	3845	3301	3301
q14	278	307	272	272
q15	579	511	500	500
q16	641	701	652	652
q17	1223	1746	1260	1260
q18	7752	7471	7241	7241
q19	873	1140	1118	1118
q20	1970	2054	1871	1871
q21	5782	5282	4857	4857
q22	619	633	581	581
Total cold run time: 54263 ms
Total hot run time: 52415 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 196788 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 28b7cca9401af13d4a416bdbc6a2455a037924ad, data reload: false

query1	1292	976	927	927
query2	6488	2359	2298	2298
query3	10929	4623	4780	4623
query4	33288	23588	23318	23318
query5	4110	632	453	453
query6	282	197	183	183
query7	3996	507	296	296
query8	294	249	237	237
query9	9205	2661	2651	2651
query10	483	326	251	251
query11	17696	15274	15265	15265
query12	164	111	106	106
query13	1592	544	408	408
query14	9720	6960	6891	6891
query15	248	232	206	206
query16	7695	613	495	495
query17	1578	756	606	606
query18	1803	423	340	340
query19	218	197	176	176
query20	143	127	119	119
query21	210	133	114	114
query22	4656	4461	4731	4461
query23	35357	33506	33522	33506
query24	6602	2259	2351	2259
query25	458	458	401	401
query26	805	271	160	160
query27	2520	466	338	338
query28	6170	2496	2476	2476
query29	562	574	428	428
query30	211	188	154	154
query31	1001	935	852	852
query32	69	60	59	59
query33	499	345	324	324
query34	760	886	515	515
query35	837	854	750	750
query36	1038	1080	983	983
query37	117	105	78	78
query38	4214	4324	4318	4318
query39	1501	1498	1470	1470
query40	204	118	104	104
query41	46	46	42	42
query42	118	104	101	101
query43	534	535	511	511
query44	1362	837	833	833
query45	194	181	168	168
query46	882	1067	698	698
query47	2011	2002	1979	1979
query48	413	403	327	327
query49	725	487	420	420
query50	671	690	411	411
query51	7346	7261	7162	7162
query52	106	105	102	102
query53	225	261	191	191
query54	476	519	420	420
query55	88	81	80	80
query56	267	259	248	248
query57	1237	1242	1167	1167
query58	239	241	221	221
query59	3219	3334	3222	3222
query60	297	277	264	264
query61	118	110	108	108
query62	890	823	742	742
query63	228	195	196	195
query64	3232	1067	688	688
query65	3321	3268	3274	3268
query66	839	418	309	309
query67	16578	15783	15467	15467
query68	8808	711	513	513
query69	467	291	257	257
query70	1263	1149	1132	1132
query71	440	283	252	252
query72	6347	3877	3853	3853
query73	656	740	362	362
query74	10435	9116	9079	9079
query75	4594	3178	2686	2686
query76	4202	1188	768	768
query77	769	367	289	289
query78	10310	9996	9357	9357
query79	3034	815	592	592
query80	717	520	436	436
query81	482	266	234	234
query82	496	146	119	119
query83	190	183	150	150
query84	285	89	77	77
query85	753	376	384	376
query86	353	315	314	314
query87	4618	4689	4390	4390
query88	3136	2192	2161	2161
query89	416	332	297	297
query90	1966	188	189	188
query91	204	135	109	109
query92	63	58	51	51
query93	1399	827	536	536
query94	659	388	291	291
query95	342	270	257	257
query96	479	607	286	286
query97	2905	3041	2801	2801
query98	208	206	196	196
query99	1708	1578	1436	1436
Total cold run time: 297381 ms
Total hot run time: 196788 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.99 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 28b7cca9401af13d4a416bdbc6a2455a037924ad, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.03	0.03
query3	0.23	0.08	0.07
query4	1.61	0.10	0.11
query5	0.43	0.43	0.40
query6	1.17	0.65	0.66
query7	0.02	0.01	0.02
query8	0.04	0.04	0.03
query9	0.59	0.50	0.51
query10	0.54	0.56	0.55
query11	0.14	0.10	0.11
query12	0.14	0.11	0.12
query13	0.60	0.62	0.60
query14	2.73	2.75	2.73
query15	0.90	0.82	0.82
query16	0.39	0.39	0.38
query17	1.04	1.02	1.04
query18	0.22	0.22	0.21
query19	1.85	1.77	2.00
query20	0.02	0.01	0.01
query21	15.36	0.97	0.57
query22	0.74	1.00	0.56
query23	15.21	1.50	0.53
query24	2.99	0.86	1.26
query25	0.16	0.17	0.15
query26	0.39	0.15	0.13
query27	0.05	0.07	0.04
query28	13.52	1.53	1.05
query29	12.58	3.90	3.23
query30	0.26	0.10	0.06
query31	2.83	0.61	0.38
query32	3.24	0.54	0.46
query33	3.08	3.07	3.14
query34	16.86	5.13	4.50
query35	4.50	4.48	4.49
query36	0.65	0.49	0.48
query37	0.09	0.06	0.06
query38	0.05	0.04	0.04
query39	0.03	0.02	0.03
query40	0.17	0.14	0.13
query41	0.08	0.03	0.02
query42	0.04	0.02	0.02
query43	0.04	0.03	0.04
Total cold run time: 105.69 s
Total hot run time: 30.99 s

@suxiaogang223
Copy link
Contributor Author

run external

@suxiaogang223 suxiaogang223 marked this pull request as draft January 13, 2025 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants