Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arbitrary number of WHEN THEN clauses in the scalar CASE function #14125

Merged
merged 2 commits into from
Oct 7, 2024

Conversation

yashmayya
Copy link
Collaborator

@yashmayya yashmayya commented Oct 1, 2024

  • Fixes Scalar CASE function only supports up to 15 WHEN THEN clauses #14126.
  • Currently, the CASE scalar function only supports up to 15 WHEN THEN clauses, and this restriction exists because there is a separate implementation for scalar CASE with 3 arguments, 4 arguments, and so on.
  • See Support up to 15 boolean expressions for casewhen scalar function #11566 for instance, which increased the number of supported clauses from 5 to 15.
  • The equivalent transform function, however, has no such restriction on the number of supported clauses -
    @Override
    public void init(List<TransformFunction> arguments, Map<String, ColumnContext> columnContextMap,
    boolean nullHandlingEnabled) {
    super.init(arguments, columnContextMap, nullHandlingEnabled);
    // Check that there are more than 2 arguments
    // Else statement can be omitted.
    if (arguments.size() < 2) {
    throw new IllegalArgumentException("At least two arguments are required for CASE-WHEN function");
    }
    int numWhenStatements = arguments.size() / 2;
    _whenStatements = new ArrayList<>(numWhenStatements);
    _thenStatements = new ArrayList<>(numWhenStatements);
    // Alternating WHEN and THEN clause, last one ELSE
    for (int i = 0; i < numWhenStatements; i++) {
    _whenStatements.add(arguments.get(i * 2));
    _thenStatements.add(arguments.get(i * 2 + 1));
    }
    if (arguments.size() % 2 != 0 && !isNullLiteral(arguments.get(arguments.size() - 1))) {
    _elseStatement = arguments.get(arguments.size() - 1);
    }
    _resultMetadata = new TransformResultMetadata(calculateResultType(), true, false);
    _computeThenStatements = new boolean[numWhenStatements];
    }
  • This leads to very confusing user experiences, since most users won't be aware about Pinot internal implementation details like transform functions vs scalar functions and which situations use which type of function. So, certain queries with > 15 when then clauses in a case function will fail with an error like IllegalArgumentException: Unsupported function: CASE with argument types: ... / IllegalArgumentException: Unsupported function: CASE with ... arguments whereas some other queries with > 15 when then clauses in a case function will succeed.
  • For instance, in the multi-stage query engine, a projection in a leaf stage will use the transform function variant, whereas a projection in an intermediate stage will use the scalar function variant. We need to support an arbitrary number of WHEN THEN clauses in the scalar CASE function as well.
  • Example of a query that currently fails in the multi-stage query engine (since intermediate stages can only use scalar functions, not transform functions):
SELECT CASE 
    WHEN val BETWEEN 1 AND 5 THEN 0
    WHEN val BETWEEN 5 AND 10 THEN 1
    ...
    ...
    WHEN val BETWEEN 95 AND 100 THEN 19
    ELSE 20 END
FROM (SELECT val FROM mytable ORDER BY val LIMIT 100);
  • Example of a query that currently fails in the single-stage query engine (since post-aggregation internally uses scalar functions):
SELECT CASE 
    WHEN SUM(val) BETWEEN 1 AND 5 THEN 0
    WHEN SUM(val) BETWEEN 5 AND 10 THEN 1
    ...
    ...
    WHEN SUM(val) BETWEEN 95 AND 100 THEN 19
    ELSE 20 END
FROM mytable;
  • Support Array constructor using literal evaluation #12278 added support for variadic argument scalar functions. This patch makes use of that ability and also updates the PostAggregationFunction class to be able to use variadic argument scalar functions similar to the changes done in other call-sites of scalar function invokers.

@codecov-commenter
Copy link

codecov-commenter commented Oct 1, 2024

Codecov Report

Attention: Patch coverage is 91.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 63.80%. Comparing base (59551e4) to head (3ea8e0d).
Report is 1140 commits behind head on master.

Files with missing lines Patch % Lines
...query/postaggregation/PostAggregationFunction.java 91.66% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14125      +/-   ##
============================================
+ Coverage     61.75%   63.80%   +2.05%     
- Complexity      207     1532    +1325     
============================================
  Files          2436     2621     +185     
  Lines        133233   144028   +10795     
  Branches      20636    22046    +1410     
============================================
+ Hits          82274    91902    +9628     
- Misses        44911    45327     +416     
- Partials       6048     6799     +751     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.78% <91.66%> (+2.07%) ⬆️
java-21 63.69% <91.66%> (+2.06%) ⬆️
skip-bytebuffers-false 63.80% <91.66%> (+2.05%) ⬆️
skip-bytebuffers-true 63.67% <91.66%> (+35.94%) ⬆️
temurin 63.80% <91.66%> (+2.05%) ⬆️
unittests 63.80% <91.66%> (+2.05%) ⬆️
unittests1 55.47% <91.66%> (+8.58%) ⬆️
unittests2 34.34% <0.00%> (+6.61%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yashmayya yashmayya force-pushed the case-when-scalar-variadic branch from 3791518 to d65c1de Compare October 1, 2024 08:28
@yashmayya yashmayya marked this pull request as ready for review October 1, 2024 09:22
@yashmayya yashmayya requested review from gortiz and Jackie-Jiang and removed request for gortiz October 1, 2024 09:22
@yashmayya yashmayya added the query label Oct 1, 2024


@ScalarFunction(nullableParameters = true, names = {"case", "caseWhen"})
public class CaseWhenScalarFunction implements PinotScalarFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have a class for it? Seems it doesn't involve polymorphism, so isVarArg should be able to handle it. See ArrayFunctions.arrayValueConstructor as an example

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to validate that there are 2 or more arguments to the function and from what I can tell, we currently can't do that with the scalar function approach. It shouldn't be too difficult to add another parameter to the ScalarFunction annotation class for such a validation, but it seems cleaner to use the PinotScalarFunction class approach instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the registered SqlStdOperatorTable.CASE perform this type matching? Ideally the type matching should happen in the SqlOperator instead of here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard SqlCaseOperator does have type matching logic, but actually I just realized that we don't need any validation for checking whether there are 2 or more arguments because the parser itself takes care of that in both the v1 and v2 engines. I've removed this class and moved it back as a method in ObjectFunctions.

return FUNCTION_INFO;
}

public static Object caseWhen(Object... objs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to align the behavior with transform function, do we need to calculate the result type and do a cast in the end?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transform function needs to do that because the interface requires it to define the result type; it is also able to do that fairly easily because all its input operands are also transform functions. Here, where all the arguments are simply objects, I'm guessing we'll need to use PinotDataType and this util method to determine the Pinot type for each argument and then try to find a common type (which will only work for numeric types)? Do we want to fail in case there are heterogeneous types (apart from the string representation of numeric types mixed with numeric types)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v2, will the SqlOperator derived the correct return type, and then cast the return value to the desired type? If so, then we get consistent behavior in v2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the SqlCaseOperator has similar logic to determine the "least restrictive" type across all the then clauses and the else clause and that is determined as the return type for the operator - https://github.com/apache/calcite/blob/78e873d39c0364f9f36055b9cbe0600dfad49c71/core/src/main/java/org/apache/calcite/sql/fun/SqlCaseOperator.java#L219-L323. We then cast the actual returned value from the scalar function to that computed type (or the Pinot equivalent type for the Calcite RelDataType) in the v2 engine. So it's only the v1 engine where the scalar function return type won't be determined based on the input operand types.

@yashmayya yashmayya force-pushed the case-when-scalar-variadic branch from d65c1de to 3ea8e0d Compare October 7, 2024 06:15
ColumnDataType resultType = FunctionUtils.getColumnDataType(_functionInvoker.getResultClass());
// Handle unrecognized result class with STRING
_resultType = resultType != null ? resultType : ColumnDataType.STRING;

if (!_functionInvoker.getMethod().isVarArgs()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this check, but is there a reason why we need to move the code after resultType is declared? As far as I can see the rest of the code is the same, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the diff is slightly confusing, all I did was to separate the common parts and the part which should only be executed when the scalar function is not varargs.

@Jackie-Jiang Jackie-Jiang merged commit 5d9a794 into apache:master Oct 7, 2024
20 of 21 checks passed
@yashmayya yashmayya added the multi-stage Related to the multi-stage query engine label Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix enhancement multi-stage Related to the multi-stage query engine query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scalar CASE function only supports up to 15 WHEN THEN clauses
4 participants