Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alter_table: Add design doc #30019

Merged
merged 3 commits into from
Oct 24, 2024

Conversation

ParkMyCar
Copy link
Member

Rendered

Motivation

Adds the design doc that was previously discussed in Notion.

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@ParkMyCar ParkMyCar force-pushed the alter_table/design-doc branch 3 times, most recently from 2928d36 to 2145c54 Compare October 15, 2024 22:35
@ParkMyCar ParkMyCar force-pushed the alter_table/design-doc branch from 2145c54 to 2106a29 Compare October 15, 2024 22:37
@ParkMyCar ParkMyCar marked this pull request as ready for review October 15, 2024 22:39
@ParkMyCar
Copy link
Member Author

@antiguru, @petrosagg, @jkosh44. This is the same design doc from Notion, but with the discussion from the comments added in! Would appreciate any more feedback y'all have, and a stamp when it looks good :)

Copy link
Contributor

@jkosh44 jkosh44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}
```

### Relationships
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much related it is, but worth thinking about: We associate read holds with global IDs. If multiple global IDs reference the same shard, we need to make sure that the read holds are unified at some level. I'm not familiar enough with how this part of Mz works, but it would be good to gain some confidence it does the right thing. cc @teskje

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally! I chatted with some Persist folks about this, specifically the use of CriticalSinceHandles in the storage-controller. Once I have something more concrete to test I'll check in with @teskje

Comment on lines 151 to 157
In the durable Catalog we will reuse the existing id allocator that currently mints `GlobalId`s to
mint `CatalogItemId`s. We will create a new id allocator specifically for `GlobalId`s that will be
initialized to the same value as the original allocator, this prevents accidental `GlobalId` re-use
if they are persisted outside the Catalog. Additionally we will extend the existing
[ItemValue](https://github.com/MaterializeInc/materialize/blob/b579caa68b6d287426dead8626c0adc885205740/src/catalog/protos/objects.proto#L119-L126)
protobuf type to include a map of `VERSION -> GlobalId`. Externally to map between `CatalogItemId`s
and `GlobalId`s we’ll introduce a new Catalog table, `mz_internal.mz_collection_ids`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What requires that catalog item IDs start at the current global ID counter? Please spell it out, otherwise this looks like an unnecessarily complex implementation decision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated! Let me know if what I wrote makes sense

Comment on lines 173 to 174
Our Compute layer will operate entirely on `GlobalId`s. Other than some refactoring of what Catalog
APIs our Compute layer uses, I don’t anticipate any material changes here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some ambiguity around how we select indexes: They're by global ID, so a view that depends on a earlier version of a table would not pick up an index on a later version of the same table. This can be a problem, and we need a way to make users aware.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout, I updated this section to call out we should provide notices to users for this case

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ParkMyCar
Copy link
Member Author

Thanks for the reviews!

@ParkMyCar ParkMyCar merged commit 64d94fc into MaterializeInc:main Oct 24, 2024
7 checks passed
ParkMyCar added a commit that referenced this pull request Oct 25, 2024
This PR re-keys everything in the durable Catalog on `CatalogItemId`
instead of `GlobalId`, and shims everything above the durable Catalog
objects using two new methods `GlobalId::to_item_id()`, and
`CatalogItemId::to_global_id()`.

Now every item in the Catalog has the following fields:

* `id: CatalogItemId`, a stable _external_ identifier (i.e. in Catalog
tables like `mz_objects`) that is the same for the entire lifetime of
the object.
* `global_id: GlobalId`, a stable _internal_ identifier for this object
that can be used by storage and compute.
* `extra_versions: BTreeMap<Version, GlobalId>`, mapping of versions of
an object to the `GlobalId`s used by compute and storage to refer to a
specific version.

This de-coupling of `CatalogItemId` and `GlobalId` achieves two things:

1. Externally objects have a stable identifier, even as they are
`ALTER`-ed. This is required for external tools like dbt and Terraform
that track objects by ID.
2. Internally a `GlobalId` always refers to the same `RelationDesc` +
Persist Shard, this maintains the concept from the formalism that a
`GlobalId` is never re-assigned to a new pTVC.

The implementation of `ALTER TABLE ... ADD COLUMN ...` will thus
allocate a new `GlobalId` which will immutably refer to that specific
version of the table.

#### Other Changes

Along with `ItemKey` and `ItemValue` I updated the following Catalog
types:

* `GidMappingValue`: replaced the `id` field with `catalog_id` and
`global_id`, used to identify builtin catalog objects.
* `ClusterIntrospectionSourceIndexValue`: replaced the `index_id` field
with `catalog_id` and `global_id`, used to identify builtin
introspection source indexes.
* `CommentKey`: replaced `GlobalId` with `CatalogItemId`, used to
identify comments on objects.
* `SourceReferencesKey`: replaced `GlobalId` with `CatalogItemId`, used
to track references between a Source and the subsources/tables that read
from it.

#### Partial Progress

Today `CatalogItemId` is 1:1 with `GlobalId`, this allows us to
implement the `to_item_id` and `to_global_id` shim methods. Until we
support `ALTER TABLE ... ADD COLUMN ...` we can freely convert between
the two. This allows us to break this change up among multiple PRs
instead of a single massive change.

#### Initial Migration

Because `CatalogItemId` and `GlobalId` are currently 1:1, this allows us
to migrate the raw values of the IDs, e.g. `GlobalId::User(42)` becomes
`CatalogItemId::User(42)`, which is exactly what this PR does.

### Motivation

Progress towards
MaterializeInc/database-issues#8233
Implements changes described in
#30019

### Tips for reviewer

This PR is split into two commits:

1. The durable catalog migration, and updates to durable Catalog
objects. I would appreciate the most thorough reviews on this commit.
4. Shimming all calling code to convert between `CatalogItemId` and
`GlobalId`.

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered. ([trigger-ci for additional test/nightly
runs](https://trigger-ci.dev.materialize.com/))
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] If this PR includes major [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note),
I have pinged the relevant PM to schedule a changelog post.
ParkMyCar added a commit that referenced this pull request Nov 12, 2024
This PR switches the Adapter/Catalog from referencing items with a
`GlobalId`, to a `CatalogItemId`.

Internally the compute and storage layers of Materialize will still
reference objects by their `GlobalId`, but conceptually a `GlobalId`
refers to a "collection of data", while a `CatalogItemId` refers to the
database object. This switch allows us to associate multiple `GlobalId`s
(aka "collections of data") with a single Catalog object. The primary
motivator for this is supporting live schema migrations, or `ALTER TABLE
... ADD COLUMN ...`.

When adding a new column to a table we'll create a new `GlobalId`, this
ensures that the schema (aka `RelationDesc`) associated with a
`GlobalId` never changes.

### Design

The durable Catalog migration to add `CatalogItemId` took place in
#30163.

What took a bit of thought is where to place the associated `GlobalId`s
for each object, we have the following requirements:

* `Table`: multiple `GlobalId`s
* `Source`: single `GlobalId` (maybe eventually multiple?)
* `Log`: single `GlobalId`
* `View`: single `GlobalId`
* `MaterializedView`: single `GlobalId`
* `ContinualTask`: single `GlobalId`
* `Sink`: single `GlobalId`
* `Index`: single `GlobalId`
* `Type`: no `GlobalId`s
* `Func`: no `GlobalId`s
* `Secret`: no `GlobalId`s
* `Connection`: no `GlobalId`s


`Type`, `Func`, `Secret`, and `Connection` are never referenced by the
compute and storage layer so they don't need a `GlobalId`. Meanwhile
`Table`s can be evolved so they need to support multiple `GlobalId`s.
What complicates this a bit is how we use the `create_sql` for an object
as our durable record, while we could come up with some syntax like
`CREATE TABLE [<name> WITH u1 u2 u3] ...` as a way to associate multiple
`GlobalId`s with an object, this didn't feel great.

What I landed on was persisting a single `GlobalId` for all objects, and
then including an `extra_versions: BTreeMap<Version, GlobalId>` at the
durable layer. While not all types need a `GlobalId`, associating a
single `GlobalId` for each object makes things easier to reason about.
And while I would like to "design away invalid states", trying to
associate multiple `GlobalId`s with only tables added a bunch of
boilerplate which I didn't think was worth it.

But at the in-memory layer is where we enforce only `Table`s can have
multiple `GlobalId`s. Instead of sticking a `global_id: GlobalId` field
on CatalogEntry, which might feel natural, I put them on the inner `enum
CatalogItem` variants. If a caller has just a `CatalogEntry`, they're
forced to reason about their item possibly having multiple `GlobalId`s
associated with it. But if they match on the inner `CatalogItem` they
can get the inner object and generally get the single `GlobalId`, or if
it's a table, reason about which `GlobalId` to use.

### Motivation

Implements the approach described in:
#30019
Progress towards:
MaterializeInc/database-issues#8233

### Tips for reviewer

I tried splitting up this PR into multiple smaller ones, but the amount
of shimming required to convert back and forth between `GlobalId` and
`CatalogItemId` made the changes very noisy. As such, I split this
single PR into multiple commits where each commit migrates a single
logical code path.

Commits that require the most attention are annotated with a ⭐, and
tagged by the teams most directly impacted.

1. ⭐ [storage] Migrates code paths for `Connection`s and `Secret`s to
use `CatalogItemId`. This doesn't require much attention, it's largely
just a find and replace since these objects are only ever referenced by
the Catalog.
* Something to call out though is this does change Protobuf definitions
which AFAIK are not durably persisted anywhere. But someone from Storage
should double check me.
2. ⭐ [adapter] This updates our in-memory objects to store a `GlobalId`
per-object, for `Table`s we store multiple `GlobalId`s. To review this
commit I would start by looking at `src/catalog/src/memory/objects.rs`
to get an understanding of where `GlobalId`s live and how the individual
objects were updated. It also updates our durable Catalog transactions
to remove the shims introduced in
#30163. What's subtle
here is we also change the `allocate_user_item_ids(...)` API to return
_both_ a `CatalogItemId` and `GlobalId` with the same inner value. This
ensures that that string representation of a `CatalogItemId` for an
object will always be the same as the `GlobalId`, this allows us to
sidestep the issue for now of updating builtin tables to map between
`CatalogItemId` and `GlobalId`
3. ⭐ [adapter] Migrates the builtin objects codepaths to associate both
a `CatalogItemId` and a `GlobalId` with a builtin object. Builtin
objects are handled separately from normal objects, hence having to
migrate the separate code path.
4. ⭐ [adapter] Changes SQL name resolution to use `CatalogItemId`. It
also changes the `ResolvedIds` newtype from a `BTreeSet<GlobalId>` to a
`BTreeMap<CatalogItemId, BTreeSet<GlobalId>>`. This way after name
resolution we not only know what objects a statement refers to, but also
their underlying collections. This also makes the inner field of
`ResolvedIds` private and migrates its callsites.
5. ⭐ [adapter] Adds new APIs to the Catalog to get items by their
associated `GlobalId`, and adds a new trait called
`CatalogCollectionItem`. I would start reviewing this commit by looking
at `src/sql/src/catalog.rs` to get an idea of the API surface.
* The new `CatalogCollectionItem` is really only necessary to support
`ALTER TABLE` and isn't strictly necessary for this change, but I found
that it adds context to the change. The `desc(...)` method is removed
from `trait CatalogItem` and moved to `trait CatalogCollectionItem`. You
can only get a `CatalogCollectionItem` (and thus `RelationDesc`) if you
have a `GlobalId` or a `CatalogItemId` + `Version`.
6. [compute] Updates the `trait OptimizerCatalog` to allow getting an
item by a `CatalogItemId`, and updates implementations to use new
Catalog APIs. This commit is small and only needs to be skimmed.
7. [adapter] Introduces a new `DependencyIds` field on some Catalog
items. `ResolvedIds` are derived from name resolution, meanwhile
`DependencyIds` are derived from the `HirRelationExpr` created after
planning.
* This was added because the `CatalogEntry::uses` method wants to return
a list of `CatalogItemId`s, but the `HirRelationExpr` only knows about
`GlobalId`s. We could map from `GlobalId` -> `CatalogItemId` when
calling the method, but this was a decent refactor on it's own, so I
opted to cache these dependencies during planning on the few types of
objects that need them.
8. [adapter] Migrates the Coordinator startup/bootstrap codepath to use
the new APIs introduced in [5]. No new logic is introduced here, it's
largely just a find and replace like `entry.id()` to
`mat_view.global_id()`.
9. ⭐ [adapter] Updates all create plans (i.e. `PlanCreate*`) to
associate a `GlobalId` with each item.
10. [adapter] Updates plans for `COPY` and `INSERT` to use
`CatalogItemId` instead of `GlobalId`, since we also want to insert into
collections at their latest version. This commit is largely a find and
replace and can be skimmed.
11. [adapter] Updates all drop plans (i.e. `PlanDrop*`) to use
`CatalogItemId`, this is pretty much a find and replace and can be
skimmed.
12. [adapter] Updates all alter plans (i.e. `PlanAlter*`) to use
`CatalogItemId`, this is pretty much a find and replace and can be
skimmed.
13. [adapter] Updates all code paths for `COMMENT ON` and Webhook
Sources to use `CatalogItemId`. This is a fairly small commit and can
definitely be skimmed.
14. ⭐ [adapter/compute]Updates all Peek and Subscribe plans to use
`CatalogItemId`. This commit is interesting and could use some extra
thought since it's where we map from a catalog item to it's collection,
and is one of the primary motivators of this work.
15. [adapter] Updates all `SHOW` and `INSPECT` plans to use
`CatalogItemId`. This commit is quite small and can be skimmed.
16. [adapter] Updates Catalog Consistency Checks and `PlanValidity` to
use `CatalogItemId`. This commit is quite small and can be skimmed.
17. [adapter] Updates `EXPLAIN` plans to use the new `GlobalId`-based
Catalog APIs. It's pretty straight forward and probably doesn't require
too much attention.
18. [adapter] This commit is a bit of a "grab bag", it updates code
paths for auto-routing queries to `mz_catalog_server`, cluster
scheduling, and "caught up" checks to map between `CatalogItemId` and
`GlobalId`. It could use a bit of extra attention.
19. ⭐ [adapter/persist] Updates `ScalarType`s to use `CatalogItemId`
instead of `GlobalId`.
* IMO this is one of the sketchiest commits of the entire PR because we
durably persist `ScalarType`s in the Persist schema registry. This
change should be okay though since the protobuf type of `CatalogItemId`
is exactly the same as `GlobalId` (minus the `Explain` variant) so it is
wire compatible.
20. [adapter] Migrates all builtin table writes (aka the `pack_*`
method`) to use `CatalogItemId`.
21. Deletes the `GlobalId::to_item_id` and `CatalogItemId::to_global_id`
shim methods.

### Checklist

- [x] This PR has adequate test coverage / QA involvement has been duly
considered. ([trigger-ci for additional test/nightly
runs](https://trigger-ci.dev.materialize.com/))
- [x] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [x] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [x] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [x] If this PR includes major [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note),
I have pinged the relevant PM to schedule a changelog post.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants