Better define Substrait "required types" #747

jacques-n · 2024-11-21T00:48:08Z

[This came up in community sync today.]

There are some situations where Substrait defines expectation of specific types. Examples include:

The datatype of filter expression must be boolean
The output of an aggregate operation with multiple grouping sets will include an integer index as an additional column.
(Future) the limit and offset values can be expressions but must be expressions of type x (int64?)

This is somewhat non-intuitive since we generally don't have a formal requirement that a system support specific data types. We should evaluate whether:

We have required types, OR
These are simply type aliases and substrait users can define which concrete type maps to these aliases. (e.g. booleanLike=boolean and integerLike=Decimal(38,0) or similar)
Come up with a better option.

westonpace · 2024-11-21T02:03:42Z

I would answer 1. Some relations have required types. However, I do not believe this prevents systems with alternative representations from fully implementing Substrait.

I ran into a similar problem when mapping the Arrow type system to Substrait. The problem I ran into is "when is X a variation / implementation of Y and when are X and Y different types". I decided on the following (non-canonical) definition:

Every value of Y must map to a unique value of X
Every function must operate such for every value i we have X(f(Y(i))) == f(i) where X and Y are the mapping functions between the two types.

In other words, an 8-bit integer is a valid variation of boolean. The forward mapping function is "false maps to 0 and true maps to 1". The reverse mapping function can be "0 maps to false and everything else maps to true". This would only be true if the engine's boolean functions (e.g. and, or, not) actually respected the boolean semantics defined in our yaml files (e.g. and(0, 1) must be 0 and so on).

Therefore, I believe that an engine is free to use whatever physical implementation of Substrait's types it wants to use provided they are compatible variations as described above.

The datatype of filter expression must be boolean

Again, if an engine is using 8-bit integers for booleans that is fine.

The output of an aggregate operation with multiple grouping sets will include an integer index as an additional column.

Note: this is not just an integer index. It is an i32 index. If an engine wants to use 128-bit 1's complement integers internally then this is fine, that's a valid variation of i32.

(Future) the limit and offset values can be expressions but must be expressions of type x (int64?)

If we define this as an i64 and an engine wants to use a Decimal(38, 0) internally then this is fine. Decimal(38, 0) is a valid variation of i64.

However, if an engine wants to use Decimal(9, 0) (32-bit decimal) then this is NOT a valid variation of i64. There are i64 values which cannot be encoded into Decimal(9, 0). The engine will return an error when, according to the Substrait specification, it should not.

westonpace · 2024-11-21T02:08:44Z

(Future) the limit and offset values can be expressions but must be expressions of type x (int64?)

One more note on this. If an engine is using Decimal(38, 0) internally and the limit expression evaluates to something that cannot be represented as an i64 then the conversion from Substrait should insert a special piece to the express that returns an error (limit out of range) in this case.

westonpace · 2024-11-21T02:44:28Z

That being said, I'm not all that bothered by option 2. I'd say you are starting to formalize dialects. We already have ways to express dialect-style variations in functions (options, different sets of supported functions, etc.) and it makes sense we'd have ways to support dialect-style variations in relations too. I think it's fine to say "some engines have a limit relation that supports i32 and some have a limit relation that supports i64" or something like that.

ingomueller-net · 2024-11-21T10:22:48Z

Let me throw a couple of ideas into the discussion.

With the exceptions of the Boolean in filter and the grouping set ID in aggregate, we leave types completely open, right? This means that a producer uses whatever types it sees fit and the consumer can decide what to do with it. Hopefully, producers use as standard types as possible but they don't have to. If a consumer does not natively support a type in a plan it is given, it can (1) decide to not execute it or (2) try to reproduce the behavior of the types in the plan with the types it does natively support.

I think that that latter case corresponds to (or is an instance of) the mapping that @westonpace described above. In other words: the consumer does whatever it has to to simulate the behavior of the type specified in the plan such that the observable result is exactly as if the consumer actually supported the type. For me, that seems to be perfectly fine and in line with the rest of the spec. Consumers need to produce the behavior specified by the plan no matter how they achieve it. As another example, they have the freedom to execute an emit clause as part of the plan node of the correspond rel in their plan format or execute it as a dedicated node afterwards. Since the observable result is the same, they respect the spec. Let's also note that this describes the situation for the whole spec, i.e., something that exists outside of the discussion of filter, aggregate, and fetch: a producer and a consumer can only exchange plans if they already have a common understanding of types.

In this spirit, I think one could argue to leave the types in question up to the producer/consumer pair as well. For aggregate and fetch, we only have to require that it's a type suitable for counting. (We could even think about lifting the requirements in aggregate to allow for any type that supports identity test.) If a producer/consumer pair shares the types -- no problem. If the consumer can simulate the types in the plan, that's also fine. Otherwise, that producer/consumer pair cannot exchange these plans but that's something that can happen anyways.

Concretely, for the three cases, I think this would mean:

Filter would use a "Boolean-like" type instead of boolean, i.e., a type where one value means that the tuple is part of the result and the other value means it is not.
The type of the grouping set ID would be "integer-like", i.e., a type with enumerable values that are used for the groups in this order, or "ID-like", i.e., a type that supports == and !=, and it would be plan-defined, i.e., the AggregateRel would define the type of the corresponding column in a to-be-added parameter.
The (future) limit and offset expressions would be "integer-like" (as defined above) and the exact type would be derivable from the expression in the plan (possibly enforcing that both expressions need to have the same type.

In order to reduce divergence, the spec could also strongly suggest to use boolean and i32/i64, respectively.

I think that this is better than forcing a particular type because (1) it does not help in situations where the producer and consumer do not share a type they could use, i.e., a forced typed doesn't magically make incompatible producer/consumer pairs compatible, and (2) it would break producer/consumer pairs that can otherwise agree on a different type.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better define Substrait "required types" #747

Better define Substrait "required types" #747

jacques-n commented Nov 21, 2024

westonpace commented Nov 21, 2024

westonpace commented Nov 21, 2024

westonpace commented Nov 21, 2024 •

edited

Loading

ingomueller-net commented Nov 21, 2024

Better define Substrait "required types" #747

Better define Substrait "required types" #747

Comments

jacques-n commented Nov 21, 2024

westonpace commented Nov 21, 2024

westonpace commented Nov 21, 2024

westonpace commented Nov 21, 2024 • edited Loading

ingomueller-net commented Nov 21, 2024

westonpace commented Nov 21, 2024 •

edited

Loading