Support for Shallow Clone / Zero Copy Cloning in Apache Iceberg #12263

nqvuong1998 · 2025-02-14T03:28:37Z

Feature Request / Improvement

Description: I would like to request a feature similar to Databricks' Shallow Clone or Snowflake's Zero Copy Cloning in Apache Iceberg. This feature would enable users to create a new Iceberg table that references the same underlying data files as an existing table without duplicating storage.

Motivation: Currently, Iceberg supports snapshot-based branching and time-travel capabilities, but it does not provide a mechanism to create a "cloned" table that references existing data without copying it. Introducing a shallow clone feature would provide several benefits:

Storage Efficiency: Avoids unnecessary duplication of data files, reducing storage costs.
Fast Table Creation: Enables near-instant table creation, as only metadata needs to be managed.
Flexible Data Management: Supports use cases such as testing, experimentation, and versioning without physical data replication.

Proposed Solution: The implementation could leverage Iceberg’s metadata layer to create a new table with the same data files as an existing table, while allowing future modifications to be independent. Key considerations:

The cloned table should inherit the snapshot of the source table at the time of cloning.
Future writes to the cloned table should create new data files, without affecting the source table.
Clones should support metadata-based optimizations such as compaction and partition pruning.
Optional: Ability to specify whether metadata updates (e.g., schema changes) in the source table should propagate to the clone.

Query engine

Spark
Trino
StarRocks

Willingness to contribute

I can contribute this improvement/feature independently
I would be willing to contribute this improvement/feature with guidance from the Iceberg community
I cannot contribute this improvement/feature at this time

databius · 2025-02-14T04:35:00Z

+1
It is a great feature.

RussellSpitzer · 2025-02-14T15:57:32Z

Can you elaborate how it would be different than branching?

nqvuong1998 · 2025-02-17T07:17:44Z

Hi @RussellSpitzer ,

While Iceberg already supports branching, this feature differs in the following ways:

Branching creates an isolated version of a table's metadata that can diverge over time, while shallow cloning creates a new table reference that does not inherit future changes from the source unless explicitly refreshed.
Branches maintain a complete history of changes and allow commits, merges, and rollbacks, whereas shallow clones are meant for lightweight table duplication without maintaining lineage.
Shallow clones focus on quick duplication of datasets for different workloads (e.g., testing, experimentation) without affecting the original table structure, unlike branches that are designed for collaborative versioning and long-term dataset evolution.

Fokko · 2025-02-17T09:47:37Z

Future writes to the cloned table should create new data files, without affecting the source table.

I don't think that's an issue, all the (meta)data in Iceberg is immutable. What could happen is that the original table progresses, and at some point, the snapshot that the other table cloned off will expire. This will then break the cloned table. This is something that we need to figure out as part of the location ownership #9133

I agree with @RussellSpitzer that this is very similar to branching, and it also overlaps with creating a view of a specific version of the table.

unlike branches that are designed for collaborative versioning and long-term dataset evolution.

Branches are pretty flexible, and I think it could also work for your use-case here. For example, see write-audit-publish.

RussellSpitzer · 2025-02-17T18:42:38Z

I don't follow these points

Branching creates an isolated version of a table's metadata that can diverge over time, while shallow cloning creates a new table reference that does not inherit future changes from the source unless explicitly refreshed.

Branches do not inherit future changes from source?

Branches maintain a complete history of changes and allow commits, merges, and rollbacks, whereas shallow clones are meant for lightweight table duplication without maintaining lineage.

Branches don't maintain the complete history, they are essentially just a tag in the metadata.json and while they can allow for other operations to be performed on top of them I'm not sure how that's different than a shallow clone.

Shallow clones focus on quick duplication of datasets for different workloads (e.g., testing, experimentation) without affecting the original table structure, unlike branches that are designed for collaborative versioning and long-term dataset evolution.

What stops a branch being used for testing or experimentation? How would this effect the original table?

databius · 2025-02-18T01:48:42Z

I do think the purpose of branching Iceberg is different from the purpose of cloning.

In our case, we want to copy production data to a separate environment. It must be safe, ensuring production data is protected in all situations, separate from any activities on the cloning environment.

In BigQuery, it's pretty simple to CLONE a TB table in just a few tens of seconds.

Iceberg branching is very useful for wap pattern, but it is not that safe. Let's say, you switch to another branch using Spark.

SET spark.wap.branch=tmp;

Then drop the table:

DROP TABLE your_catalog.your_dataset.your_table PURGE;

Then say goodbye to your data.

RussellSpitzer · 2025-02-18T15:14:22Z

SET spark.wap.branch=tmp;

Then drop the table:

DROP TABLE your_catalog.your_dataset.your_table PURGE;

Then say goodbye to your data.

If a shallow clone shares the data files of the original table how is that different? This is what i'm getting at. What are we looking for here that doesn't exist within branching? What features specifically do we want to add or change?

databius · 2025-02-19T02:34:52Z

What are we looking for here that doesn't exist within branching?

My case copying production data to a separate environment. Example:

CREATE TABLE myproject.myDataset_backup.myTableClone
CLONE myproject.myDataset.myTable;

I can then perform any operation on myTableClone including dropping it without affecting the original table.

RussellSpitzer · 2025-02-19T04:29:31Z

Again, how is this different than a branch? Are you just saying you want a branch with a different catalog identifier?

databius · 2025-02-19T04:56:50Z

Again, how is this different than a branch?

Some differences:

Access control
Some cases I need to run a script to drop a table, experiment it on a clone table does not affect the original table while using another branch may remove my original data files

RussellSpitzer · 2025-02-19T05:16:22Z

Only 1 is a difference here and it's at the catalog level. A branch cannot affect files in another branch. If you delete a snapshot from 1 branch but it still exists in another branch it won't be lost.

Number 1, access control is a catalog concern at the moment, but I'm afraid that with vended credentials based on table location, you wouldn't be able to separate a branch (or shallow clone) from its source except at the engine level for something like Trino. That's said if the separate identity is what you are looking for we probably have to make a catalog alias which is again on the catalog side.

nqvuong1998 added the improvement PR that improves existing functionality label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Shallow Clone / Zero Copy Cloning in Apache Iceberg #12263

Support for Shallow Clone / Zero Copy Cloning in Apache Iceberg #12263

nqvuong1998 commented Feb 14, 2025 •

edited

Loading

databius commented Feb 14, 2025

RussellSpitzer commented Feb 14, 2025

nqvuong1998 commented Feb 17, 2025

Fokko commented Feb 17, 2025

RussellSpitzer commented Feb 17, 2025

databius commented Feb 18, 2025 •

edited

Loading

RussellSpitzer commented Feb 18, 2025

databius commented Feb 19, 2025

RussellSpitzer commented Feb 19, 2025

databius commented Feb 19, 2025

RussellSpitzer commented Feb 19, 2025

Support for Shallow Clone / Zero Copy Cloning in Apache Iceberg #12263

Support for Shallow Clone / Zero Copy Cloning in Apache Iceberg #12263

Comments

nqvuong1998 commented Feb 14, 2025 • edited Loading

Feature Request / Improvement

Query engine

Willingness to contribute

databius commented Feb 14, 2025

RussellSpitzer commented Feb 14, 2025

nqvuong1998 commented Feb 17, 2025

Fokko commented Feb 17, 2025

RussellSpitzer commented Feb 17, 2025

databius commented Feb 18, 2025 • edited Loading

RussellSpitzer commented Feb 18, 2025

databius commented Feb 19, 2025

RussellSpitzer commented Feb 19, 2025

databius commented Feb 19, 2025

RussellSpitzer commented Feb 19, 2025

nqvuong1998 commented Feb 14, 2025 •

edited

Loading

databius commented Feb 18, 2025 •

edited

Loading