-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Signature::Coercible with user defined implicit casting #14440
base: main
Are you sure you want to change the base?
Conversation
|
||
#[derive(Debug, Clone, Eq, PartialOrd, Hash)] | ||
pub struct ParameterType { | ||
pub param_type: LogicalTypeRef, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
param_type
: target type of function signature
allowed_casts
: implicit coercion allowed to cast to target type.
For example,
param_type
: string
allowed_casts
: binary, int
Valid: All are casted to string
func(string)
func(binary)
func(int)
Invalid:
func(float or other types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried that Vec<Vec<ParameterType>>
might become challenging to reason about. It would also be confusing about when one use this new signature rather than Signature::Coercable
🤔
Given this seems very similar to Signature::Coercable
, and Signature::Coercable
mirrors what we want pretty well, could add some new information there on the allowed coercions rather than an entire new type signature. Something like extending Coercable
with rules could be used to coerce to the target type:
pub enum TypeSignature {
...
Coercible(Vec<Coercion>),
...
}
Where Coercion
looks like
struct Coercion {
desired_type: TypeSignatureClass,
allowed_casts: ... // also includes an option for User Defined?
}
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have a breaking change to Signature::Coercible
too, the only concern is whether this cause regression or large impact to downstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shehabgamin If we replace CoercibleV2
with Signature::Coercible
would it be a large change we should be concerned to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, I'll replace Signature::Coercible with Signature::CoercibleV2.
23953dd
to
806c6a6
Compare
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
806c6a6
to
c99e986
Compare
vec![ | ||
TypeSignatureClass::Native(logical_string()), | ||
TypeSignatureClass::Native(logical_int64()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In v1 version, any integer is casted to the defined NativeType's default casted type which is i64 in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like how much more explicit the new formualtion is
// Accept all integer types but cast them to i64 | ||
Coercion { | ||
desired_type: TypeSignatureClass::Native(logical_int64()), | ||
allowed_casts: vec![TypeSignatureClass::Integer], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, i32 is rejected.
signature: Signature::coercible_v2( | ||
vec![Coercion { | ||
desired_type: TypeSignatureClass::Native(logical_string()), | ||
allowed_casts: vec![TypeSignatureClass::Native(logical_binary())], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coercing binary to string is now easily customizable
I will try and review this carefully over the weekend Maybe @shehabgamin has some time to take a look too |
I will review over the weekend as well! |
or close this first then revisit #14268 BTW, most of the binary-to-string conversions mentioned in #14268 might not be ideal for DataFusion. We should reconsider them. |
@jayzhan211 is this PR ready for another review? |
yes, please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think this PR moves the coercsion story forward significantly. Thank you @jayzhan211 and @shehabgamin
I think the Coercion
struct(maybe enum) is really close to the general purpose API we have been desperately lacking these many years in DataFusion to describe coercions. Let me try some things and report back
@@ -198,6 +198,11 @@ impl LogicalType for NativeType { | |||
TypeSignature::Native(self) | |||
} | |||
|
|||
/// Returns the default casted type for the given arrow type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
/// | ||
/// Get all possible types for `information_schema` from the given `TypeSignature` | ||
// | ||
// TODO: Make this function private |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could potentially mark it as deprecated and make the new (non pub) version with a different name
Maybe something like this (can do in a follow on PR)
pub fn get_example_types(&self) -> Vec<Vec<DataType>> {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would need NativeType for get_example_types
match signature_classes { | ||
TypeSignatureClass::Native(l) => get_data_types(l.native()), | ||
TypeSignatureClass::Timestamp => { | ||
vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is listing all possible DataType combination helpful?
I don't think this is helpful to be honest -- just the combination of classes
@@ -431,6 +463,35 @@ impl TypeSignature { | |||
} | |||
} | |||
|
|||
fn get_possible_types_from_signature_classes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @shehabgamin on both points
@@ -431,6 +463,35 @@ impl TypeSignature { | |||
} | |||
} | |||
|
|||
fn get_possible_types_from_signature_classes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested name:
fn get_possible_types_from_signature_classes( | |
fn get_example_types( |
And we can put it on TypeSignatureClass
vec![DataType::Date64] | ||
} | ||
TypeSignatureClass::Time => { | ||
vec![DataType::Time64(TimeUnit::Nanosecond)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok to have one example type as this is only used for inforamtin schema -- the naming is super confusing though -- I think renaming the functions would make it much better
for (current_type, param) in current_types.iter().zip(param_types.iter()) { | ||
let current_logical_type: NativeType = current_type.into(); | ||
|
||
fn is_matched_type( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function also seems like it would make more sense as a method on TypeSignatureClass
vec![ | ||
TypeSignatureClass::Native(logical_string()), | ||
TypeSignatureClass::Native(logical_int64()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like how much more explicit the new formualtion is
@@ -460,6 +521,44 @@ fn get_data_types(native_type: &NativeType) -> Vec<DataType> { | |||
} | |||
} | |||
|
|||
#[derive(Debug, Clone, Eq, PartialOrd)] | |||
pub struct Coercion { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 , I believe @alamb 's question (please correct me if I'm wrong) is about creating functionality for a downstream user to override the default signature of a UDF in order to provide their own coercion rules.
This is what I was getting at
What do you think about making Coercion
an enum
like this (I can make a follow on PR / propose changes to this PR):
#[derive(Debug, Clone, Eq, PartialOrd)]
pub enum Coercion {
/// the argument type must match desired_type
Exact {
desired_type: TypeSignatureClass,
}
/// The argument type must be coercable to the desired type using the implicit coercion
Implict {
desired_type: TypeSignatureClass,
implicit_coercion: ImplicitCoercion,
}
}
Then we can eventually maybe add a "user defined coercion" variant
Let me make a PR to see what this would look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is one software engineering PR:
- Encapsulates logic from TypeSignatureClass, rename methods jayzhan211/datafusion#4
(to encapsulate this code more)
Enum looks good to me but "adding user defined coercion variant" is not clear to me since I think the Coercible signature is user-defined builtin, they can change the type they need and set it into their own function 🤔 |
That is true that I was thinking for @shehabgamin 's usecase where he wanted to implement coercsion that had a slightly different rules about when to cast Integer --> String. He could implement that with an entirely custom coercion, but it might be nice to just adjust the rules somehow (and reuse most of the rest of the logic ) 🤔 |
Modifying Signature::Coercible alone won't achieve what @shehabgamin wants because, regardless of the approach, they would still need to:
Even if we introduce another variant, we would still maintain a version for DataFusion, while others would have to go through the same process to customize it for their needs. A better solution would be to separate the function signature from the function itself, as shown below: pub struct ScalarFunctionExpr {
fun: Arc<ScalarUDF>,
name: String,
args: Vec<Arc<dyn PhysicalExpr>>,
return_type: DataType,
nullable: bool,
// new
signature: TypeSignature
}
function_expr.with_signature(customize signature) This change fundamentally impacts how scalar functions are structured... |
Encapsulates logic from TypeSignatureClass, rename methods
To support int->string in coercion, This change is all you need Coercion::new_exact(TypeSignatureClass::Native(logical_string())), Coercion::new_implicit(TypeSignatureClass::Native(logical_string()), vec![TypeSignatureClass::Integer], NativeType::String), |
I took the liberty of merging up from main to get the CI to pass again. I hope to spend a bit longer today messing around with potential changes to Coercion for your consideration |
I was hoping it might be possible to make a 'wrapper' function that only overrides the signature but passes everything else in Though now that I type this, perhaps what is really needed is user defined coercion for operators (like |
Maybe we could make this a builder style API too -- I hope to mess around with it later todya |
@jayzhan211 I will re-review by tomorrow EOD, exciting progress! |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?