Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design doc: MIR typechecking using representation types #27239

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 24 additions & 23 deletions doc/developer/design/20240522_mir_representation_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,33 +48,35 @@ We know these implicit coercions interfere with joins; they may also interfere w

We will not rethink other components of MIR types, like nullability or unique keys.

In general, we may want a "smart index selector", which can appropriately handle mixed-type lookups, i.e., it should let us look up an `i32` in an `i64`-sized index without needing to build a separate `i32`-sized arrangement. Such an approach generalizes `eq-indx` (see ["Alternatives"](#alternatives)).

## Solution Proposal

Having MIR be typechecked in terms of representation types demands a few changes, which can be broken into three PRs:

1. Rename the existing `ScalarType` to `SqlScalarType`. (We may want to rename `ColumnType` and `RelationType`, as well.)
2. Have MIR work with `ReprScalarType` (see [MVP](#minimal-viable-prototype)).
a. Introduce the `ReprScalarType` datatype (see [MVP](#minimal-viable-prototype)).
b. Change the optimizer to work with `ReprScalarType` throughout.
c. Change the adapter to record the `SqlScalarType` of a query and update its nullability information using the `ReprScalarType` at the end of optimization.
1. Rename the existing `ScalarType` to `HirScalarType`. (We may want to rename `ColumnType` and `RelationType`, as well.)
2. Have MIR work with `MirScalarType` (see [MVP](#minimal-viable-prototype)).
a. Introduce the `MirScalarType` datatype (see [MVP](#minimal-viable-prototype)).
b. Change the optimizer to work with `MirScalarType` throughout.
c. Change the adapter to record the `HirScalarType` of a query and update its nullability information using the `MirScalarType` at the end of optimization.
3. Elide noop casts, confirming that we plan the problematic joins better.

PR #2 will have a large diff.
On the plus side, it should be a largely mechanical change: there is a straightforward way to project a `SqlScalarType` to a `ReprScalarType`, and most uses of types in the optimizer will be parametric.
On the plus side, it should be a largely mechanical change: there is a straightforward way to project a `HirScalarType` to a `MirScalarType`, and most uses of types in the optimizer will be parametric.

## Minimal Viable Prototype

The representation types will look like the following:

```rust
pub enum ReprScalarType {
pub enum MirScalarType {
Bool,
Int16,
Int32,
Int64,
UInt8, // also includes SqlScalarType::PgLegacyChar
UInt8, // also includes HirScalarType::PgLegacyChar
UInt16,
UInt32, // also includes SqlScalarType::{Oid,RegClass,RegProc,RegType}
UInt32, // also includes HirScalarType::{Oid,RegClass,RegProc,RegType}
UInt64,
Float32,
Float64,
Expand All @@ -87,23 +89,27 @@ pub enum ReprScalarType {
Interval,
Bytes,
Jsonb,
String, // also includes SqlScalarType::{VarChar,Char,PgLegacyName}
String, // also includes HirScalarType::{VarChar,Char,PgLegacyName}
Uuid,
Array(Box<ReprScalarType>), // also includes SqlScalarType::Int2Vector
List { element_type: Box<ReprScalarType> }, // also includes SqlScalarType::Record
Map { value_type: Box<ReprScalarType> },
Range { element_type: Box<ReprScalarType> },
Array(Box<MirScalarType>), // also includes HirScalarType::Int2Vector
List { element_type: Box<MirScalarType> }, // also includes HirScalarType::Record
Map { value_type: Box<MirScalarType> },
Range { element_type: Box<MirScalarType> },
MzAclItem,
AclItem,
}

pub struct ReprColumnType {
pub scalar_type: ReprScalarType,
pub struct MirColumnType {
pub scalar_type: MirScalarType,
pub nullable: bool,
}
```

We'll need to define a projection from `SqlScalarType` to `ReprScalarType` (see [`ScalarType::is_instance_of()`](https://github.com/MaterializeInc/materialize/blob/e4578164fb28a204b43c58ab8ff6c1d3e3b4427f/src/repr/src/scalar.rs#L947) for the correspondence).
Since LIR is effectively untyped, there's no need for us to define `LirScalarType` or the like. We already use `MirScalarExpr` inside of `Mfp`s in LIR.

We'll need to define a projection from `HirScalarType` to `MirScalarType` (see [`ScalarType::is_instance_of()`](https://github.com/MaterializeInc/materialize/blob/e4578164fb28a204b43c58ab8ff6c1d3e3b4427f/src/repr/src/scalar.rs#L947) for the correspondence).

In `EXPLAIN PLAN`s, we will report `MirScalarType`s with `r`-prefixed names, where `r` is short for "representation", e.g., `rstring` is the `MirScalarType` corresponding to `text`, `varchar`, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this prefix should be m or mir if we're going with MirScalarType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the word mir is shown anywhere to users in our current EXPLAIN output. Very happy to be told what to do here (the r came from earlier discussion).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go with mir (e.g., mirstring) to start. I think it's the clearest. Since EXPLAIN output isn't stable we have a ton of flexibility to change it if we get user complaints.


## Alternatives

Expand All @@ -123,11 +129,6 @@ Such rewriting feels a bit risky/annoying/confusing, and I'm not sure how it wou

## Open questions

How should we report `ReprScalarType`s to the user in `EXPLAIN PLAN`s?
There's a potential for confusion if we report something as type `string` that the user expected to have type `VARCHAR(18)` or something similar.
Apparently we already treat `string` as a synonym for `text`... can we somehow choose names that will be unambiguous?
Say, `mzstring` or `charseq` or `utf8`?

It's possible that (as Nikhil worries) somewhere in the pgwire or WS layers calls `.typ()` on a `MirRelationExpr` and expects a `SqlScalarType` but will now find a `ReprScalarType`.
It's possible that (as Nikhil worries) somewhere in the pgwire or WS layers calls `.typ()` on a `MirRelationExpr` and expects a `HirScalarType` but will now find a `MirScalarType`.
I don't know those layers well, but we could derisk some of the work of PR #2 by checking for that in advance.
(In any case, the Rust type checker will be eager to tell us about it.)