Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Identity Columns in Apache Iceberg #12297

Open
1 of 3 tasks
nqvuong1998 opened this issue Feb 17, 2025 · 1 comment
Open
1 of 3 tasks

Support for Identity Columns in Apache Iceberg #12297

nqvuong1998 opened this issue Feb 17, 2025 · 1 comment
Labels
improvement PR that improves existing functionality

Comments

@nqvuong1998
Copy link

nqvuong1998 commented Feb 17, 2025

Feature Request / Improvement

Summary:
Apache Iceberg should support identity columns similar to Delta Lake. This feature would allow users to define identity columns in Iceberg tables, where unique values are automatically generated when not explicitly provided during writes.

Motivation:
Currently, Apache Iceberg does not provide built-in support for identity columns. In contrast, Delta Lake allows defining identity columns that generate unique values when users do not explicitly provide them. This feature simplifies the handling of primary keys and auto-incrementing IDs in use cases such as:

  • Maintaining unique row identifiers in tables without requiring external sequence management.

  • Enabling better support for incremental ingestion scenarios where records require unique IDs.

  • Reducing complexity for users transitioning from traditional databases that support auto-incrementing primary keys.

Proposed Implementation:

  • Introduce a new table property (e.g., identity.column=true) to enable identity columns on specific fields.

  • Define syntax for identity column declaration during table creation (e.g., CREATE TABLE ... (id BIGINT IDENTITY, name STRING)).

  • Implement automatic value generation for identity columns when an explicit value is not provided.

  • Ensure compatibility with Iceberg’s partitioning, snapshot isolation, and metadata management.

Alternatives Considered:

  • Using externally managed sequences or UUIDs, but these approaches introduce additional complexity and overhead.

  • Leveraging application-side logic to generate unique values, which is not as efficient as native support.

Additional Context:
Delta Lake’s identity column feature is described here. A similar implementation in Iceberg would improve usability and adoption.

Query engine

  • Spark
  • Trino
  • StarRocks

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time
@nqvuong1998 nqvuong1998 added the improvement PR that improves existing functionality label Feb 17, 2025
@RussellSpitzer
Copy link
Member

RussellSpitzer commented Feb 17, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement PR that improves existing functionality
Projects
None yet
Development

No branches or pull requests

2 participants