Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signature::Coercible with user defined implicit casting #14440

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Feb 3, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions functions labels Feb 3, 2025

#[derive(Debug, Clone, Eq, PartialOrd, Hash)]
pub struct ParameterType {
pub param_type: LogicalTypeRef,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

param_type: target type of function signature
allowed_casts: implicit coercion allowed to cast to target type.

For example,
param_type: string
allowed_casts: binary, int

Valid: All are casted to string
func(string) 
func(binary)
func(int)
Invalid:
func(float or other types)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried that Vec<Vec<ParameterType>> might become challenging to reason about. It would also be confusing about when one use this new signature rather than Signature::Coercable 🤔

Given this seems very similar to Signature::Coercable, and Signature::Coercable mirrors what we want pretty well, could add some new information there on the allowed coercions rather than an entire new type signature. Something like extending Coercable with rules could be used to coerce to the target type:

pub enum TypeSignature {
...
    Coercible(Vec<Coercion>),
...
}

Where Coercion looks like

struct Coercion {
  desired_type: TypeSignatureClass,
  allowed_casts: ... // also includes an option for User Defined?
}

🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have a breaking change to Signature::Coercible too, the only concern is whether this cause regression or large impact to downstream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shehabgamin If we replace CoercibleV2 with Signature::Coercible would it be a large change we should be concerned to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, I'll replace Signature::Coercible with Signature::CoercibleV2.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 6, 2025
@jayzhan211 jayzhan211 marked this pull request as draft February 7, 2025 00:38
@github-actions github-actions bot added the common Related to common crate label Feb 7, 2025
@jayzhan211 jayzhan211 changed the title Draft: coercible signature Signature::Coercible with user defined implicit casting Feb 7, 2025
vec![
TypeSignatureClass::Native(logical_string()),
TypeSignatureClass::Native(logical_int64()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v1 version, any integer is casted to the defined NativeType's default casted type which is i64 in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like how much more explicit the new formualtion is

// Accept all integer types but cast them to i64
Coercion {
desired_type: TypeSignatureClass::Native(logical_int64()),
allowed_casts: vec![TypeSignatureClass::Integer],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, i32 is rejected.

signature: Signature::coercible_v2(
vec![Coercion {
desired_type: TypeSignatureClass::Native(logical_string()),
allowed_casts: vec![TypeSignatureClass::Native(logical_binary())],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coercing binary to string is now easily customizable

@jayzhan211 jayzhan211 marked this pull request as ready for review February 7, 2025 03:57
@jayzhan211 jayzhan211 added the api change Changes the API exposed to users of the crate label Feb 7, 2025
@jayzhan211 jayzhan211 requested a review from alamb February 7, 2025 13:04
@alamb
Copy link
Contributor

alamb commented Feb 7, 2025

I will try and review this carefully over the weekend

Maybe @shehabgamin has some time to take a look too

@shehabgamin
Copy link
Contributor

I will try and review this carefully over the weekend

Maybe @shehabgamin has some time to take a look too

I will review over the weekend as well!

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Feb 12, 2025

@jayzhan211 Should we port any relevant tests from the old PR? #14268

or close this first then revisit #14268

BTW, most of the binary-to-string conversions mentioned in #14268 might not be ideal for DataFusion. We should reconsider them.

@alamb
Copy link
Contributor

alamb commented Feb 13, 2025

@jayzhan211 is this PR ready for another review?

@jayzhan211
Copy link
Contributor Author

@jayzhan211 is this PR ready for another review?

yes, please

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think this PR moves the coercsion story forward significantly. Thank you @jayzhan211 and @shehabgamin

I think the Coercion struct(maybe enum) is really close to the general purpose API we have been desperately lacking these many years in DataFusion to describe coercions. Let me try some things and report back

@@ -198,6 +198,11 @@ impl LogicalType for NativeType {
TypeSignature::Native(self)
}

/// Returns the default casted type for the given arrow type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

///
/// Get all possible types for `information_schema` from the given `TypeSignature`
//
// TODO: Make this function private
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could potentially mark it as deprecated and make the new (non pub) version with a different name

Maybe something like this (can do in a follow on PR)

    pub fn get_example_types(&self) -> Vec<Vec<DataType>> {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would need NativeType for get_example_types

match signature_classes {
TypeSignatureClass::Native(l) => get_data_types(l.native()),
TypeSignatureClass::Timestamp => {
vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is listing all possible DataType combination helpful?

I don't think this is helpful to be honest -- just the combination of classes

@@ -431,6 +463,35 @@ impl TypeSignature {
}
}

fn get_possible_types_from_signature_classes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @shehabgamin on both points

@@ -431,6 +463,35 @@ impl TypeSignature {
}
}

fn get_possible_types_from_signature_classes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested name:

Suggested change
fn get_possible_types_from_signature_classes(
fn get_example_types(

And we can put it on TypeSignatureClass

vec![DataType::Date64]
}
TypeSignatureClass::Time => {
vec![DataType::Time64(TimeUnit::Nanosecond)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to have one example type as this is only used for inforamtin schema -- the naming is super confusing though -- I think renaming the functions would make it much better

for (current_type, param) in current_types.iter().zip(param_types.iter()) {
let current_logical_type: NativeType = current_type.into();

fn is_matched_type(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function also seems like it would make more sense as a method on TypeSignatureClass

vec![
TypeSignatureClass::Native(logical_string()),
TypeSignatureClass::Native(logical_int64()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like how much more explicit the new formualtion is

@@ -460,6 +521,44 @@ fn get_data_types(native_type: &NativeType) -> Vec<DataType> {
}
}

#[derive(Debug, Clone, Eq, PartialOrd)]
pub struct Coercion {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 , I believe @alamb 's question (please correct me if I'm wrong) is about creating functionality for a downstream user to override the default signature of a UDF in order to provide their own coercion rules.

This is what I was getting at

What do you think about making Coercion an enum like this (I can make a follow on PR / propose changes to this PR):

#[derive(Debug, Clone, Eq, PartialOrd)]
pub enum Coercion {
    /// the argument type must match desired_type
    Exact {
     desired_type: TypeSignatureClass,
    }

   /// The argument type must be coercable to the desired type using the implicit coercion
   Implict {
    desired_type: TypeSignatureClass,
    implicit_coercion: ImplicitCoercion,
   }
}

Then we can eventually maybe add a "user defined coercion" variant

Let me make a PR to see what this would look like.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is one software engineering PR:

@jayzhan211
Copy link
Contributor Author

Enum looks good to me but "adding user defined coercion variant" is not clear to me since I think the Coercible signature is user-defined builtin, they can change the type they need and set it into their own function 🤔

@alamb
Copy link
Contributor

alamb commented Feb 14, 2025

Enum looks good to me but "adding user defined coercion variant" is not clear to me since I think the Coercible signature is user-defined builtin, they can change the type they need and set it into their own function 🤔

That is true that TypeSignature::Coercable allows users to define whatever they want.

I was thinking for @shehabgamin 's usecase where he wanted to implement coercsion that had a slightly different rules about when to cast Integer --> String. He could implement that with an entirely custom coercion, but it might be nice to just adjust the rules somehow (and reuse most of the rest of the logic ) 🤔

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Feb 14, 2025

Enum looks good to me but "adding user defined coercion variant" is not clear to me since I think the Coercible signature is user-defined builtin, they can change the type they need and set it into their own function 🤔

That is true that TypeSignature::Coercable allows users to define whatever they want.

I was thinking for @shehabgamin 's usecase where he wanted to implement coercsion that had a slightly different rules about when to cast Integer --> String. He could implement that with an entirely custom coercion, but it might be nice to just adjust the rules somehow (and reuse most of the rest of the logic ) 🤔

Modifying Signature::Coercible alone won't achieve what @shehabgamin wants because, regardless of the approach, they would still need to:

  1. Fork the entire function
  2. Modify its signature
  3. Register it in the function map

Even if we introduce another variant, we would still maintain a version for DataFusion, while others would have to go through the same process to customize it for their needs.

A better solution would be to separate the function signature from the function itself, as shown below:

pub struct ScalarFunctionExpr {
    fun: Arc<ScalarUDF>,
    name: String,
    args: Vec<Arc<dyn PhysicalExpr>>,
    return_type: DataType,
    nullable: bool,

   // new
   signature: TypeSignature
}

function_expr.with_signature(customize signature)

This change fundamentally impacts how scalar functions are structured...

Encapsulates logic from TypeSignatureClass, rename methods
@github-actions github-actions bot added the catalog Related to the catalog crate label Feb 14, 2025
@jayzhan211
Copy link
Contributor Author

To support int->string in coercion,

This change is all you need

                    Coercion::new_exact(TypeSignatureClass::Native(logical_string())),
                    Coercion::new_implicit(TypeSignatureClass::Native(logical_string()), vec![TypeSignatureClass::Integer], NativeType::String),

@alamb
Copy link
Contributor

alamb commented Feb 14, 2025

I took the liberty of merging up from main to get the CI to pass again.

I hope to spend a bit longer today messing around with potential changes to Coercion for your consideration

@alamb
Copy link
Contributor

alamb commented Feb 14, 2025

Modifying Signature::Coercible alone won't achieve what @shehabgamin wants because, regardless of the approach, they would still need to:

  1. Fork the entire function
  2. Modify its signature
  3. Register it in the function map

I was hoping it might be possible to make a 'wrapper' function that only overrides the signature but passes everything else in

Though now that I type this, perhaps what is really needed is user defined coercion for operators (like = or other comparisons 🤔 )

@alamb
Copy link
Contributor

alamb commented Feb 14, 2025

To support int->string in coercion,

This change is all you need

                    Coercion::new_exact(TypeSignatureClass::Native(logical_string())),
                    Coercion::new_implicit(TypeSignatureClass::Native(logical_string()), vec![TypeSignatureClass::Integer], NativeType::String),

Maybe we could make this a builder style API too -- I hope to mess around with it later todya

@shehabgamin
Copy link
Contributor

@jayzhan211 I will re-review by tomorrow EOD, exciting progress!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate catalog Related to the catalog crate common Related to common crate functions logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants