Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement](function)make SUBSTRING_INDEX function DEPEND_ON_ARGUMENT #30392

Merged
merged 35 commits into from
Feb 2, 2024
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
6634f14
[fix](function) make STRLEFT and STRRIGHT function DEPEND_ON_ARGUMENT
koarz Dec 13, 2023
06a2b86
format
koarz Dec 13, 2023
ae4796a
fix left and right
koarz Dec 14, 2023
908e67a
fix
koarz Dec 14, 2023
3c822a5
format
koarz Dec 14, 2023
e036cd8
fix
koarz Dec 14, 2023
1bb782a
fix
koarz Dec 14, 2023
9911e5a
fix
koarz Dec 14, 2023
e2d92c7
format
koarz Dec 14, 2023
dcc5b5c
update doc
koarz Dec 14, 2023
cbff39b
fix
koarz Dec 15, 2023
ed500cb
fix docs
koarz Dec 15, 2023
386b65b
fix docs
koarz Dec 15, 2023
68a471a
add case
koarz Dec 18, 2023
17edeb7
fix
koarz Dec 19, 2023
ca96d22
Merge branch 'master' of github.com:koarz/doris
koarz Jan 15, 2024
e6bbf68
update
koarz Jan 16, 2024
c6de052
version
koarz Jan 16, 2024
d8d82f0
Merge branch 'master' into koarz
koarz Jan 21, 2024
9e1e2ec
fix
koarz Jan 23, 2024
c6a179b
Merge branch 'koarz' of github.com:koarz/doris into koarz
koarz Jan 23, 2024
c9f924e
Merge branch 'master' into koarz
koarz Jan 23, 2024
2796aa6
fix
koarz Jan 23, 2024
ee9715b
Merge branch 'koarz' of github.com:koarz/doris into koarz
koarz Jan 23, 2024
f5c061c
fix
koarz Jan 23, 2024
52c0692
fix old
koarz Jan 23, 2024
6009c00
Merge branch 'apache:master' into koarz
koarz Jan 24, 2024
dfbf245
fix
koarz Jan 24, 2024
889dc01
fix
koarz Jan 25, 2024
f54917d
Merge branch 'koarz' of github.com:koarz/doris into koarz
koarz Jan 25, 2024
5d27b5a
Merge branch 'koarz' of github.com:koarz/doris into koarz
koarz Jan 25, 2024
78b7553
format
koarz Jan 25, 2024
b7ecc51
fix
koarz Jan 26, 2024
1171516
fix
koarz Jan 31, 2024
2e19b44
fix
koarz Jan 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion be/src/agent/be_exec_version_manager.h
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ class BeExecVersionManager {
* c. cleared old version of Version 2.
* d. unix_timestamp function support timestamp with float for datetimev2, and change nullable mode.
* e. change shuffle serialize/deserialize way
* f. the right function outputs NULL when the function contains NULL, substr function returns empty if start > str.length.
* f. the right function outputs NULL when the function contains NULL, substr function returns empty if start > str.length, and change some function nullable mode.
*/
constexpr inline int BeExecVersionManager::max_be_exec_version = 3;
constexpr inline int BeExecVersionManager::min_be_exec_version = 0;
Expand Down
1 change: 1 addition & 0 deletions be/src/vec/functions/function_string.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1020,6 +1020,7 @@ void register_function_string(SimpleFunctionFactory& factory) {
factory.register_alternative_function<FunctionSubstringOld<Substr2ImplOld>>();
factory.register_alternative_function<FunctionLeftOld>();
factory.register_alternative_function<FunctionRightOld>();
factory.register_alternative_function<FunctionSubstringIndexOld>();

factory.register_alias(FunctionLeft::name, "strleft");
factory.register_alias(FunctionRight::name, "strright");
Expand Down
157 changes: 155 additions & 2 deletions be/src/vec/functions/function_string.h
Original file line number Diff line number Diff line change
Expand Up @@ -1866,6 +1866,159 @@ class FunctionSubstringIndex : public IFunction {
String get_name() const override { return name; }
size_t get_number_of_arguments() const override { return 3; }

DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
return std::make_shared<DataTypeString>();
}

bool use_default_implementation_for_nulls() const override { return true; }

Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'execute_impl' has cognitive complexity of 86 (threshold 50) [readability-function-cognitive-complexity]

    Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
           ^
Additional context

be/src/vec/functions/function_string.h:1888: +1, including nesting penalty of 0, nesting level increased to 1

        if (part_number == 0 || delimiter_size == 0) {
        ^

be/src/vec/functions/function_string.h:1888: +1

        if (part_number == 0 || delimiter_size == 0) {
                             ^

be/src/vec/functions/function_string.h:1889: +2, including nesting penalty of 1, nesting level increased to 2

            for (size_t i = 0; i < input_rows_count; ++i) {
            ^

be/src/vec/functions/function_string.h:1892: +1, nesting level increased to 1

        } else if (part_number > 0) {
               ^

be/src/vec/functions/function_string.h:1893: +2, including nesting penalty of 1, nesting level increased to 2

            if (delimiter_size == 1) {
            ^

be/src/vec/functions/function_string.h:1895: +3, including nesting penalty of 2, nesting level increased to 3

                for (size_t i = 0; i < input_rows_count; ++i) {
                ^

be/src/vec/functions/function_string.h:1899: +4, including nesting penalty of 3, nesting level increased to 4

                    while (num < part_number) {
                    ^

be/src/vec/functions/function_string.h:1903: +5, including nesting penalty of 4, nesting level increased to 5

                        if (pos != nullptr) {
                        ^

be/src/vec/functions/function_string.h:1906: +1, nesting level increased to 5

                        } else {
                          ^

be/src/vec/functions/function_string.h:1908: +6, including nesting penalty of 5, nesting level increased to 6

                            num = (num == 0) ? 0 : num + 1;
                                             ^

be/src/vec/functions/function_string.h:1913: +4, including nesting penalty of 3, nesting level increased to 4

                    if (num == part_number) {
                    ^

be/src/vec/functions/function_string.h:1918: +1, nesting level increased to 4

                    } else {
                      ^

be/src/vec/functions/function_string.h:1923: +1, nesting level increased to 2

            } else {
              ^

be/src/vec/functions/function_string.h:1926: +3, including nesting penalty of 2, nesting level increased to 3

                for (size_t i = 0; i < input_rows_count; ++i) {
                ^

be/src/vec/functions/function_string.h:1930: +4, including nesting penalty of 3, nesting level increased to 4

                    while (num < part_number) {
                    ^

be/src/vec/functions/function_string.h:1934: +5, including nesting penalty of 4, nesting level increased to 5

                        if (pos < str.data + str.size) {
                        ^

be/src/vec/functions/function_string.h:1937: +1, nesting level increased to 5

                        } else {
                          ^

be/src/vec/functions/function_string.h:1939: +6, including nesting penalty of 5, nesting level increased to 6

                            num = (num == 0) ? 0 : num + 1;
                                             ^

be/src/vec/functions/function_string.h:1944: +4, including nesting penalty of 3, nesting level increased to 4

                    if (num == part_number) {
                    ^

be/src/vec/functions/function_string.h:1949: +1, nesting level increased to 4

                    } else {
                      ^

be/src/vec/functions/function_string.h:1955: +1, nesting level increased to 1

        } else {
          ^

be/src/vec/functions/function_string.h:1958: +2, including nesting penalty of 1, nesting level increased to 2

            for (size_t i = 0; i < input_rows_count; ++i) {
            ^

be/src/vec/functions/function_string.h:1965: +3, including nesting penalty of 2, nesting level increased to 3

                while (num <= part_number && offset >= 0) {
                ^

be/src/vec/functions/function_string.h:1965: +1

                while (num <= part_number && offset >= 0) {
                                          ^

be/src/vec/functions/function_string.h:1967: +4, including nesting penalty of 3, nesting level increased to 4

                    if (offset != -1) {
                    ^

be/src/vec/functions/function_string.h:1968: +5, including nesting penalty of 4, nesting level increased to 5

                        if (++num == part_number) {
                        ^

be/src/vec/functions/function_string.h:1974: +1, nesting level increased to 4

                    } else {
                      ^

be/src/vec/functions/function_string.h:1978: +3, including nesting penalty of 2, nesting level increased to 3

                num = (offset == -1 && num != 0) ? num + 1 : num;
                                                 ^

be/src/vec/functions/function_string.h:1978: +1

                num = (offset == -1 && num != 0) ? num + 1 : num;
                                    ^

be/src/vec/functions/function_string.h:1980: +3, including nesting penalty of 2, nesting level increased to 3

                if (num == part_number) {
                ^

be/src/vec/functions/function_string.h:1981: +4, including nesting penalty of 3, nesting level increased to 4

                    if (offset == -1) {
                    ^

be/src/vec/functions/function_string.h:1984: +1, nesting level increased to 4

                    } else {
                      ^

be/src/vec/functions/function_string.h:1990: +1, nesting level increased to 3

                } else {
                  ^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'execute_impl' exceeds recommended size/complexity thresholds [readability-function-size]

    Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
           ^
Additional context

be/src/vec/functions/function_string.h:1862: 136 lines including whitespace and comments (threshold 80)

    Status execute_impl(FunctionContext* context, Block& block, const ColumnNumbers& arguments,
           ^

size_t result, size_t input_rows_count) const override {
DCHECK_EQ(arguments.size(), 3);

// Create a zero column to simply implement
auto res = ColumnString::create();

auto& res_offsets = res->get_offsets();
auto& res_chars = res->get_chars();
res_offsets.resize(input_rows_count);
ColumnPtr content_column;
bool content_const = false;
std::tie(content_column, content_const) =
unpack_if_const(block.get_by_position(arguments[0]).column);

const auto* str_col = assert_cast<const ColumnString*>(content_column.get());

[[maybe_unused]] const auto& [delimiter_col, delimiter_const] =
unpack_if_const(block.get_by_position(arguments[1]).column);
auto delimiter = delimiter_col->get_data_at(0);
int32_t delimiter_size = delimiter.size;

[[maybe_unused]] const auto& [part_num_col, part_const] =
unpack_if_const(block.get_by_position(arguments[2]).column);
auto part_number = *((int*)part_num_col->get_data_at(0).data);

if (part_number == 0 || delimiter_size == 0) {
for (size_t i = 0; i < input_rows_count; ++i) {
StringOP::push_empty_string(i, res_chars, res_offsets);
}
} else if (part_number > 0) {
if (delimiter_size == 1) {
// If delimiter is a char, use memchr to split
for (size_t i = 0; i < input_rows_count; ++i) {
auto str = str_col->get_data_at(i);
int32_t offset = -1;
int32_t num = 0;
while (num < part_number) {
size_t n = str.size - offset - 1;
const char* pos = reinterpret_cast<const char*>(
memchr(str.data + offset + 1, delimiter.data[0], n));
if (pos != nullptr) {
offset = pos - str.data;
num++;
} else {
offset = str.size;
num = (num == 0) ? 0 : num + 1;
break;
}
}

if (num == part_number) {
StringOP::push_value_string(
std::string_view {reinterpret_cast<const char*>(str.data),
(size_t)offset},
i, res_chars, res_offsets);
} else {
StringOP::push_value_string(std::string_view(str.data, str.size), i,
res_chars, res_offsets);
}
}
} else {
StringRef delimiter_ref(delimiter);
StringSearch search(&delimiter_ref);
for (size_t i = 0; i < input_rows_count; ++i) {
auto str = str_col->get_data_at(i);
int32_t offset = -delimiter_size;
int32_t num = 0;
while (num < part_number) {
size_t n = str.size - offset - delimiter_size;
// search first match delimter_ref index from src string among str_offset to end
const char* pos = search.search(str.data + offset + delimiter_size, n);
if (pos < str.data + str.size) {
offset = pos - str.data;
num++;
} else {
offset = str.size;
num = (num == 0) ? 0 : num + 1;
break;
}
}

if (num == part_number) {
StringOP::push_value_string(
std::string_view {reinterpret_cast<const char*>(str.data),
(size_t)offset},
i, res_chars, res_offsets);
} else {
StringOP::push_value_string(std::string_view(str.data, str.size), i,
res_chars, res_offsets);
}
}
}
} else {
// if part_number is negative
part_number = -part_number;
for (size_t i = 0; i < input_rows_count; ++i) {
auto str = str_col->get_data_at(i);
auto str_str = str.to_string();
int32_t offset = str.size;
int32_t pre_offset = offset;
int32_t num = 0;
auto substr = str_str;
while (num <= part_number && offset >= 0) {
offset = (int)substr.rfind(delimiter, offset);
if (offset != -1) {
if (++num == part_number) {
break;
}
pre_offset = offset;
offset = offset - 1;
substr = str_str.substr(0, pre_offset);
} else {
break;
}
}
num = (offset == -1 && num != 0) ? num + 1 : num;

if (num == part_number) {
if (offset == -1) {
StringOP::push_value_string(std::string_view(str.data, str.size), i,
res_chars, res_offsets);
} else {
StringOP::push_value_string(
std::string_view {str.data + offset + delimiter_size,
str.size - offset - delimiter_size},
i, res_chars, res_offsets);
}
} else {
StringOP::push_value_string(std::string_view(str.data, str.size), i, res_chars,
res_offsets);
}
}
}

block.get_by_position(result).column = std::move(res);
return Status::OK();
}
};

class FunctionSubstringIndexOld : public IFunction {
public:
static constexpr auto name = "substring_index";
static FunctionPtr create() { return std::make_shared<FunctionSubstringIndex>(); }
String get_name() const override { return name; }
size_t get_number_of_arguments() const override { return 3; }

DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: method 'get_return_type_impl' can be made static [readability-convert-member-functions-to-static]

Suggested change
DataTypePtr get_return_type_impl(const DataTypes& arguments) const override {
static DataTypePtr get_return_type_impl(const DataTypes& arguments) override {

return make_nullable(std::make_shared<DataTypeString>());
}
Expand All @@ -1889,15 +2042,15 @@ class FunctionSubstringIndex : public IFunction {
std::tie(content_column, content_const) =
unpack_if_const(block.get_by_position(arguments[0]).column);

if (auto* nullable = check_and_get_column<const ColumnNullable>(*content_column)) {
if (const auto* nullable = check_and_get_column<const ColumnNullable>(*content_column)) {
// Danger: Here must dispose the null map data first! Because
// argument_columns[0]=nullable->get_nested_column_ptr(); will release the mem
// of column nullable mem of null map
VectorizedUtils::update_null_map(null_map->get_data(), nullable->get_null_map_data());
content_column = nullable->get_nested_column_ptr();
}

auto str_col = assert_cast<const ColumnString*>(content_column.get());
const auto* str_col = assert_cast<const ColumnString*>(content_column.get());

[[maybe_unused]] const auto& [delimiter_col, delimiter_const] =
unpack_if_const(block.get_by_position(arguments[1]).column);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
import org.apache.doris.catalog.FunctionSignature;
import org.apache.doris.nereids.exceptions.AnalysisException;
import org.apache.doris.nereids.trees.expressions.Expression;
import org.apache.doris.nereids.trees.expressions.functions.AlwaysNullable;
import org.apache.doris.nereids.trees.expressions.functions.ExplicitlyCastableSignature;
import org.apache.doris.nereids.trees.expressions.functions.PropagateNullable;
import org.apache.doris.nereids.trees.expressions.shape.TernaryExpression;
import org.apache.doris.nereids.trees.expressions.visitor.ExpressionVisitor;
import org.apache.doris.nereids.types.IntegerType;
Expand All @@ -37,7 +37,7 @@
* ScalarFunction 'substring_index'. This class is generated by GenerateFunction.
*/
public class SubstringIndex extends ScalarFunction
implements TernaryExpression, ExplicitlyCastableSignature, AlwaysNullable {
implements TernaryExpression, ExplicitlyCastableSignature, PropagateNullable {

public static final List<FunctionSignature> SIGNATURES = ImmutableList.of(
FunctionSignature.ret(VarcharType.SYSTEM_DEFAULT)
Expand Down
4 changes: 2 additions & 2 deletions gensrc/script/doris_builtins_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -1603,7 +1603,7 @@
[['money_format'], 'VARCHAR', ['DECIMAL128'], ''],
[['split_by_string'],'ARRAY_VARCHAR',['STRING','STRING'], ''],
[['split_part'], 'VARCHAR', ['VARCHAR', 'VARCHAR', 'INT'], 'ALWAYS_NULLABLE'],
[['substring_index'], 'VARCHAR', ['VARCHAR', 'VARCHAR', 'INT'], 'ALWAYS_NULLABLE'],
[['substring_index'], 'VARCHAR', ['VARCHAR', 'VARCHAR', 'INT'], 'DEPEND_ON_ARGUMENT'],
[['extract_url_parameter'], 'VARCHAR', ['VARCHAR', 'VARCHAR'], ''],

[['sub_replace'], 'VARCHAR', ['VARCHAR', 'VARCHAR', 'INT'], 'ALWAYS_NULLABLE'],
Expand Down Expand Up @@ -1658,7 +1658,7 @@
[['money_format'], 'STRING', ['DECIMAL64'], ''],
[['money_format'], 'STRING', ['DECIMAL128'], ''],
[['split_part'], 'STRING', ['STRING', 'STRING', 'INT'], 'ALWAYS_NULLABLE'],
[['substring_index'], 'STRING', ['STRING', 'STRING', 'INT'], 'ALWAYS_NULLABLE']
[['substring_index'], 'STRING', ['STRING', 'STRING', 'INT'], 'DEPEND_ON_ARGUMENT']
],


Expand Down
Loading