Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Refactor site navigation bar #12289

Merged
merged 1 commit into from
Feb 18, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 0 additions & 49 deletions site/docs/concepts/catalog.md

This file was deleted.

42 changes: 36 additions & 6 deletions site/docs/terms.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,39 +20,69 @@ title: "Terms"

# Terms

### Snapshot
## Catalog
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, the catalog.md was always a bit lonely as it was the only concepts page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually favor a little more on keeping catalog.md as a separate page, as the content is more than just terms. just keep this page in the new Specification section. Catalog is an important concept and deserves more details that the content page captured than simple term definition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree catalogs deserve more details but having both catalogs and terms looks confusing to me. Isn't catalog a term?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree that catalog is a term, which is good to be clarified in the terms page. but the content also talked a lot more about implementations. anyway, I can go either way on this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what about taking Catalog one level up?

Currently its the first subtab under "Terms". Would it make sense to make it its on sub in "Specification"?

Screenshot 2025-02-18 at 9 06 17 AM


### Overview

You may think of Iceberg as a format for managing data in a single table, but the Iceberg library needs a way to keep track of those tables by name. Tasks like creating, dropping, and renaming tables are the responsibility of a catalog. Catalogs manage a collection of tables that are usually grouped into namespaces. The most important responsibility of a catalog is tracking a table's current metadata, which is provided by the catalog when you load a table.

The first step when using an Iceberg client is almost always initializing and configuring a catalog. The configured catalog is then used by compute engines to execute catalog operations. Multiple types of compute engines using a shared Iceberg catalog allows them to share a common data layer.

A catalog is almost always configured through the processing engine which passes along a set of properties during initialization. Different processing engines have different ways to configure a catalog. When configuring a catalog, it’s always best to refer to the [Iceberg documentation](../docs/latest/configuration.md#catalog-properties) as well as the docs for the specific processing engine being used. Ultimately, these configurations boil down to a common set of catalog properties that will be passed to configure the Iceberg catalog.

### Catalog Implementations

Iceberg catalogs are flexible and can be implemented using almost any backend system. They can be plugged into any Iceberg runtime, and allow any processing engine that supports Iceberg to load the tracked Iceberg tables. Iceberg also comes with a number of catalog implementations that are ready to use out of the box.

This includes:

* REST: a server-side catalog that’s exposed through a REST API
* Hive Metastore: tracks namespaces and tables using a Hive metastore
* JDBC: tracks namespaces and tables in a simple JDBC database
* Nessie: a transactional catalog that tracks namespaces and tables in a database with git-like version control

There are more catalog types in addition to the ones listed here as well as custom catalogs that are developed to include specialized functionality.

### Decoupling Using the REST Catalog

The REST catalog was introduced in the Iceberg 0.14.0 release and provides greater control over how Iceberg catalogs are implemented. Instead of using technology-specific logic contained in the catalog clients, the implementation details of a REST catalog lives on the catalog server. If you’re familiar with Hive, this is somewhat similar to the Hive thrift service that allows access to a hive server over a single port. The server-side logic can be written in any language and use any custom technology, as long as the API follows the [Iceberg REST Open API specification](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml).

A great benefit of the REST catalog is that it allows you to use a single client to talk to any catalog backend. This increased flexibility makes
it easier to make custom catalogs compatible with engines like Athena or Starburst without requiring the inclusion of a Jar into the classpath.

## Snapshot

A **snapshot** is the state of a table at some time.

Each snapshot lists all of the data files that make up the table's contents at the time of the snapshot. Data files are stored across multiple [manifest](#manifest-file) files, and the manifests for a snapshot are listed in a single [manifest list](#manifest-list) file.

### Manifest list
## Manifest list

A **manifest list** is a metadata file that lists the [manifests](#manifest-file) that make up a table snapshot.

Each manifest file in the manifest list is stored with information about its contents, like partition value ranges, used to speed up metadata operations.

### Manifest file
## Manifest file

A **manifest file** is a metadata file that lists a subset of data files that make up a snapshot.

Each data file in a manifest is stored with a [partition tuple](#partition-tuple), column-level stats, and summary information used to prune splits during [scan planning](docs/latest/performance.md#scan-planning).

### Partition spec
## Partition spec

A **partition spec** is a description of how to [partition](docs/latest/partitioning.md) data in a table.

A spec consists of a list of source columns and transforms. A transform produces a partition value from a source value. For example, `date(ts)` produces the date associated with a timestamp column named `ts`.

### Partition tuple
## Partition tuple

A **partition tuple** is a tuple or struct of partition data stored with each data file.

All values in a partition tuple are the same for all rows stored in a data file. Partition tuples are produced by transforming values from row data using a partition spec.

Iceberg stores partition values unmodified, unlike Hive tables that convert values to and from strings in file system paths and keys.

### Snapshot log (history table)
## Snapshot log (history table)

The **snapshot log** is a metadata log of how the table's current snapshot has changed over time.

Expand Down
17 changes: 8 additions & 9 deletions site/nav.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
except for concepts/catalog.md, for which the content is merged into site/docs/terms.md, nothing else changed. The tabs are just moved around to under "Specification"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i also double checked that moving from one tab to another doesn't affect the URL.

For example, "Table Spec" is moved from under "Project" to under "Specification", but the URL is still /spec 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only URL that changed (removed) is concepts/catalog

Original file line number Diff line number Diff line change
Expand Up @@ -43,21 +43,20 @@ nav:
- Project:
- Community: community.md
- Contributing: contribute.md
- REST Catalog Spec: https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/apache/iceberg/main/open-api/rest-catalog-open-api.yaml
- Table Spec: spec.md
- View spec: view-spec.md
- Puffin spec: puffin-spec.md
- AES GCM Stream spec: gcm-stream-spec.md
- Implementation status: status.md
- Multi-engine support: multi-engine-support.md
- How to release: how-to-release.md
- Terms: terms.md
- ASF:
- Sponsorship: https://www.apache.org/foundation/thanks.html
- Events: https://www.apache.org/events/current-event.html
- Privacy: https://privacy.apache.org/policies/privacy-policy-public.html
- License: https://www.apache.org/licenses/
- Security: https://www.apache.org/security/
- Sponsors: https://www.apache.org/foundation/thanks.html
- Concepts:
- Catalogs: concepts/catalog.md
- Specification:
- Terms: terms.md
- REST Catalog Spec: https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/apache/iceberg/main/open-api/rest-catalog-open-api.yaml
- Table Spec: spec.md
- View spec: view-spec.md
- Puffin spec: puffin-spec.md
- AES GCM Stream spec: gcm-stream-spec.md
- Implementation status: status.md