[flink] Infer parallelism only in situation of parallelism is not set. #4975

leaves12138 · 2025-01-21T09:14:45Z

Purpose

default.parallelism is not working because scan.infer-parallelism is default true

Tests

API and Format

Documentation

… (which is -1)

yunfengzhou-hub · 2025-02-07T02:23:06Z

...nk/paimon-flink-common/src/test/java/org/apache/paimon/flink/source/DataTableSourceTest.java

@@ -81,6 +81,8 @@ void testInferScanParallelism() throws Exception {
                        null);
        PaimonDataStreamScanProvider runtimeProvider = runtimeProvider(tableSource);
        StreamExecutionEnvironment sEnv1 = StreamExecutionEnvironment.createLocalEnvironment();
+        sEnv1.setParallelism(-1);
+        System.out.println(sEnv1.getParallelism());


This print seems unnecessary.

yunfengzhou-hub · 2025-02-07T02:35:04Z

...flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/source/FlinkTableSource.java

@@ -149,9 +149,18 @@ protected Integer inferSourceParallelism(StreamExecutionEnvironment env) {
                    Boolean.parseBoolean(envConfig.toMap().get(FLINK_INFER_SCAN_PARALLELISM)));
        }
        Integer parallelism = options.get(FlinkConnectorOptions.SCAN_PARALLELISM);
-        if (parallelism == null && options.get(FlinkConnectorOptions.INFER_SCAN_PARALLELISM)) {
+        if (parallelism == null


I'm trying to understand the parallelism inferring logic this PR wants to achieve, please see if this is correct.

If Paimon option SCAN_PARALLELISM is configured, then use this value.

Else if Flink configuration parallelism.default is configured, then use this value.

Else if Paimon option INFER_SCAN_PARALLELISM is set to true, and the parallelism of source can be inferred by Paimon (like when fixed bucket + unbounded stream), then Paimon would provide an inference result.

Else Paimon would not set the parallelism of the source operator, instead it would be Flink infra that decide the parallelism of this operator

Yes, as far as I am concerned, if we set default.parallelism in flink sql environment, we should not infer the parallelism, we should respect what the user writes. If we set INFER_SCAN_PARALLELISM default to true, than we must be careful, otherwise, users may be confused, cause they set the parallelism but seems not work.

We should respect environment and user as far as we can. This function is just auxiliary, we can't depend on it.

JingsongLi

+1

yunfengzhou-hub · 2025-02-12T09:57:13Z

@leaves12138 Thanks for the update. According to our offline discussion, I learned some background information of this PR as follows.

In a Flink job where Paimon source is directly followed by a Paimon sink, the Paimon writer operator's parallelism cannot be dynamically inferred by Flink infra. The reason is because the writer operator sets its parallelism according to the upstream(source) operator, and the source operator's parallelism has explicitly decided.

So in order to allow Flink infra to change the parallelism of the writer operator, this PR made some change to the source operator's implementation. This way sounds indirect and incomplete, as Flink infra can still not set the parallelism in other situations. A better solution would be change

writerDataStream.setParalleism(inputDataStream.getParallelism());

to

writerDataStream.getTransformation().setParalleism(inputDataStream.getParallelism(), false);

Without regard to the background motivation, this PR itself looks good to me so I'm +1 on it. We may open a next PR in future to use the APIs mentioned above to better facilitate writer operators for Flink infra.

leaves12138 added 4 commits January 21, 2025 17:09

[flink] Infer parallelism only in situation of parallelism is not set…

abc0084

… (which is -1)

Fix test failure

4b76b40

Fix test failure

c446685

Fix test failure

bf0c865

yunfengzhou-hub reviewed Feb 7, 2025

View reviewed changes

Fix comment

5de957c

leaves12138 requested a review from JingsongLi February 12, 2025 03:04

JingsongLi approved these changes Feb 12, 2025

View reviewed changes

yunfengzhou-hub mentioned this pull request Feb 13, 2025

[flink] Support parallelismConfigurable #5076

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flink] Infer parallelism only in situation of parallelism is not set. #4975

[flink] Infer parallelism only in situation of parallelism is not set. #4975

leaves12138 commented Jan 21, 2025

yunfengzhou-hub Feb 7, 2025

yunfengzhou-hub Feb 7, 2025

leaves12138 Feb 10, 2025 •

edited

Loading

JingsongLi left a comment

yunfengzhou-hub commented Feb 12, 2025

[flink] Infer parallelism only in situation of parallelism is not set. #4975

Are you sure you want to change the base?

[flink] Infer parallelism only in situation of parallelism is not set. #4975

Conversation

leaves12138 commented Jan 21, 2025

Purpose

Tests

API and Format

Documentation

yunfengzhou-hub Feb 7, 2025

Choose a reason for hiding this comment

yunfengzhou-hub Feb 7, 2025

Choose a reason for hiding this comment

leaves12138 Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

JingsongLi left a comment

Choose a reason for hiding this comment

yunfengzhou-hub commented Feb 12, 2025

leaves12138 Feb 10, 2025 •

edited

Loading