Support for Identity Columns in Apache Iceberg #12297

nqvuong1998 · 2025-02-17T07:48:35Z

Feature Request / Improvement

Summary:
Apache Iceberg should support identity columns similar to Delta Lake. This feature would allow users to define identity columns in Iceberg tables, where unique values are automatically generated when not explicitly provided during writes.

Motivation:
Currently, Apache Iceberg does not provide built-in support for identity columns. In contrast, Delta Lake allows defining identity columns that generate unique values when users do not explicitly provide them. This feature simplifies the handling of primary keys and auto-incrementing IDs in use cases such as:

Maintaining unique row identifiers in tables without requiring external sequence management.
Enabling better support for incremental ingestion scenarios where records require unique IDs.
Reducing complexity for users transitioning from traditional databases that support auto-incrementing primary keys.

Proposed Implementation:

Introduce a new table property (e.g., identity.column=true) to enable identity columns on specific fields.
Define syntax for identity column declaration during table creation (e.g., CREATE TABLE ... (id BIGINT IDENTITY, name STRING)).
Implement automatic value generation for identity columns when an explicit value is not provided.
Ensure compatibility with Iceberg’s partitioning, snapshot isolation, and metadata management.

Alternatives Considered:

Using externally managed sequences or UUIDs, but these approaches introduce additional complexity and overhead.
Leveraging application-side logic to generate unique values, which is not as efficient as native support.

Additional Context:
Delta Lake’s identity column feature is described here. A similar implementation in Iceberg would improve usability and adoption.

Query engine

Spark
Trino
StarRocks

Willingness to contribute

I can contribute this improvement/feature independently
I would be willing to contribute this improvement/feature with guidance from the Iceberg community
I cannot contribute this improvement/feature at this time

RussellSpitzer · 2025-02-17T18:33:27Z

I'm not quite sure how "proposed implementation" actually would work. We probably need some actual details here. I would recommend starting a new design doc. The proposal should also integrate with the already existing concept of https://iceberg.apache.org/spec/#identifier-field-ids . The concept of auto/incrementing or generating of values for rows probably also needs considerably more discussion and probably it's own design doc.

…

On Mon, Feb 17, 2025 at 1:48 AM Nguyễn Quốc Vương ***@***.***> wrote: Feature Request / Improvement *Summary*: Apache Iceberg should support identity columns similar to Delta Lake. This feature would allow users to define identity columns in Iceberg tables, where unique values are automatically generated when not explicitly provided during writes. *Motivation*: Currently, Apache Iceberg does not provide built-in support for identity columns. In contrast, Delta Lake allows defining identity columns that generate unique values when users do not explicitly provide them. This feature simplifies the handling of primary keys and auto-incrementing IDs in use cases such as: - Maintaining unique row identifiers in tables without requiring external sequence management. - Enabling better support for incremental ingestion scenarios where records require unique IDs. - Reducing complexity for users transitioning from traditional databases that support auto-incrementing primary keys. *Proposed Implementation*: - Introduce a new table property (e.g., identity.column=true) to enable identity columns on specific fields. - Define syntax for identity column declaration during table creation (e.g., CREATE TABLE ... (id BIGINT IDENTITY, name STRING)). - Implement automatic value generation for identity columns when an explicit value is not provided. - Ensure compatibility with Iceberg’s partitioning, snapshot isolation, and metadata management. *Alternatives Considered*: - Using externally managed sequences or UUIDs, but these approaches introduce additional complexity and overhead. - Leveraging application-side logic to generate unique values, which is not as efficient as native support. Query engine None Willingness to contribute - I can contribute this improvement/feature independently - I would be willing to contribute this improvement/feature with guidance from the Iceberg community - I cannot contribute this improvement/feature at this time — Reply to this email directly, view it on GitHub <#12297>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADE2YJZCUQUQVJCKM2CTAD2QGH6RAVCNFSM6AAAAABXIVLCDSVHI2DSMVQWIX3LMV43ASLTON2WKOZSHA2TOMBUGU4DGNQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> [image: nqvuong1998]*nqvuong1998* created an issue (apache/iceberg#12297) <#12297> Feature Request / Improvement *Summary*: Apache Iceberg should support identity columns similar to Delta Lake. This feature would allow users to define identity columns in Iceberg tables, where unique values are automatically generated when not explicitly provided during writes. *Motivation*: Currently, Apache Iceberg does not provide built-in support for identity columns. In contrast, Delta Lake allows defining identity columns that generate unique values when users do not explicitly provide them. This feature simplifies the handling of primary keys and auto-incrementing IDs in use cases such as: - Maintaining unique row identifiers in tables without requiring external sequence management. - Enabling better support for incremental ingestion scenarios where records require unique IDs. - Reducing complexity for users transitioning from traditional databases that support auto-incrementing primary keys. *Proposed Implementation*: - Introduce a new table property (e.g., identity.column=true) to enable identity columns on specific fields. - Define syntax for identity column declaration during table creation (e.g., CREATE TABLE ... (id BIGINT IDENTITY, name STRING)). - Implement automatic value generation for identity columns when an explicit value is not provided. - Ensure compatibility with Iceberg’s partitioning, snapshot isolation, and metadata management. *Alternatives Considered*: - Using externally managed sequences or UUIDs, but these approaches introduce additional complexity and overhead. - Leveraging application-side logic to generate unique values, which is not as efficient as native support. Query engine None Willingness to contribute - I can contribute this improvement/feature independently - I would be willing to contribute this improvement/feature with guidance from the Iceberg community - I cannot contribute this improvement/feature at this time — Reply to this email directly, view it on GitHub <#12297>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADE2YJZCUQUQVJCKM2CTAD2QGH6RAVCNFSM6AAAAABXIVLCDSVHI2DSMVQWIX3LMV43ASLTON2WKOZSHA2TOMBUGU4DGNQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

nqvuong1998 added the improvement PR that improves existing functionality label Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Identity Columns in Apache Iceberg #12297

Support for Identity Columns in Apache Iceberg #12297

nqvuong1998 commented Feb 17, 2025 •

edited

Loading

RussellSpitzer commented Feb 17, 2025 via email

Support for Identity Columns in Apache Iceberg #12297

Support for Identity Columns in Apache Iceberg #12297

Comments

nqvuong1998 commented Feb 17, 2025 • edited Loading

Feature Request / Improvement

Query engine

Willingness to contribute

RussellSpitzer commented Feb 17, 2025 via email

nqvuong1998 commented Feb 17, 2025 •

edited

Loading