Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Shallow Clone / Zero Copy Cloning in Apache Iceberg #12263

Open
1 of 3 tasks
nqvuong1998 opened this issue Feb 14, 2025 · 11 comments
Open
1 of 3 tasks

Support for Shallow Clone / Zero Copy Cloning in Apache Iceberg #12263

nqvuong1998 opened this issue Feb 14, 2025 · 11 comments
Labels
improvement PR that improves existing functionality

Comments

@nqvuong1998
Copy link

nqvuong1998 commented Feb 14, 2025

Feature Request / Improvement

Description: I would like to request a feature similar to Databricks' Shallow Clone or Snowflake's Zero Copy Cloning in Apache Iceberg. This feature would enable users to create a new Iceberg table that references the same underlying data files as an existing table without duplicating storage.

Motivation: Currently, Iceberg supports snapshot-based branching and time-travel capabilities, but it does not provide a mechanism to create a "cloned" table that references existing data without copying it. Introducing a shallow clone feature would provide several benefits:

  • Storage Efficiency: Avoids unnecessary duplication of data files, reducing storage costs.

  • Fast Table Creation: Enables near-instant table creation, as only metadata needs to be managed.

  • Flexible Data Management: Supports use cases such as testing, experimentation, and versioning without physical data replication.

Proposed Solution: The implementation could leverage Iceberg’s metadata layer to create a new table with the same data files as an existing table, while allowing future modifications to be independent. Key considerations:

  • The cloned table should inherit the snapshot of the source table at the time of cloning.

  • Future writes to the cloned table should create new data files, without affecting the source table.

  • Clones should support metadata-based optimizations such as compaction and partition pruning.

  • Optional: Ability to specify whether metadata updates (e.g., schema changes) in the source table should propagate to the clone.

Query engine

  • Spark

  • Trino

  • StarRocks

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time
@nqvuong1998 nqvuong1998 added the improvement PR that improves existing functionality label Feb 14, 2025
@databius
Copy link

+1
It is a great feature.

@RussellSpitzer
Copy link
Member

Can you elaborate how it would be different than branching?

@nqvuong1998
Copy link
Author

Hi @RussellSpitzer ,

While Iceberg already supports branching, this feature differs in the following ways:

  • Branching creates an isolated version of a table's metadata that can diverge over time, while shallow cloning creates a new table reference that does not inherit future changes from the source unless explicitly refreshed.

  • Branches maintain a complete history of changes and allow commits, merges, and rollbacks, whereas shallow clones are meant for lightweight table duplication without maintaining lineage.

  • Shallow clones focus on quick duplication of datasets for different workloads (e.g., testing, experimentation) without affecting the original table structure, unlike branches that are designed for collaborative versioning and long-term dataset evolution.

@Fokko
Copy link
Contributor

Fokko commented Feb 17, 2025

Future writes to the cloned table should create new data files, without affecting the source table.

I don't think that's an issue, all the (meta)data in Iceberg is immutable. What could happen is that the original table progresses, and at some point, the snapshot that the other table cloned off will expire. This will then break the cloned table. This is something that we need to figure out as part of the location ownership #9133

I agree with @RussellSpitzer that this is very similar to branching, and it also overlaps with creating a view of a specific version of the table.

unlike branches that are designed for collaborative versioning and long-term dataset evolution.

Branches are pretty flexible, and I think it could also work for your use-case here. For example, see write-audit-publish.

@RussellSpitzer
Copy link
Member

I don't follow these points

  • Branching creates an isolated version of a table's metadata that can diverge over time, while shallow cloning creates a new table reference that does not inherit future changes from the source unless explicitly refreshed.

Branches do not inherit future changes from source?

  • Branches maintain a complete history of changes and allow commits, merges, and rollbacks, whereas shallow clones are meant for lightweight table duplication without maintaining lineage.

Branches don't maintain the complete history, they are essentially just a tag in the metadata.json and while they can allow for other operations to be performed on top of them I'm not sure how that's different than a shallow clone.

  • Shallow clones focus on quick duplication of datasets for different workloads (e.g., testing, experimentation) without affecting the original table structure, unlike branches that are designed for collaborative versioning and long-term dataset evolution.

What stops a branch being used for testing or experimentation? How would this effect the original table?

@databius
Copy link

databius commented Feb 18, 2025

I do think the purpose of branching Iceberg is different from the purpose of cloning.

In our case, we want to copy production data to a separate environment. It must be safe, ensuring production data is protected in all situations, separate from any activities on the cloning environment.

In BigQuery, it's pretty simple to CLONE a TB table in just a few tens of seconds.

Iceberg branching is very useful for wap pattern, but it is not that safe. Let's say, you switch to another branch using Spark.

SET spark.wap.branch=tmp;

Then drop the table:

DROP TABLE your_catalog.your_dataset.your_table PURGE;

Then say goodbye to your data.

@RussellSpitzer
Copy link
Member

SET spark.wap.branch=tmp;

Then drop the table:

DROP TABLE your_catalog.your_dataset.your_table PURGE;

Then say goodbye to your data.

If a shallow clone shares the data files of the original table how is that different? This is what i'm getting at. What are we looking for here that doesn't exist within branching? What features specifically do we want to add or change?

@databius
Copy link

What are we looking for here that doesn't exist within branching?

My case copying production data to a separate environment. Example:

CREATE TABLE myproject.myDataset_backup.myTableClone
CLONE myproject.myDataset.myTable;

I can then perform any operation on myTableClone including dropping it without affecting the original table.

@RussellSpitzer
Copy link
Member

Again, how is this different than a branch? Are you just saying you want a branch with a different catalog identifier?

@databius
Copy link

Again, how is this different than a branch?

Some differences:

  1. Access control
  2. Some cases I need to run a script to drop a table, experiment it on a clone table does not affect the original table while using another branch may remove my original data files

@RussellSpitzer
Copy link
Member

Only 1 is a difference here and it's at the catalog level. A branch cannot affect files in another branch. If you delete a snapshot from 1 branch but it still exists in another branch it won't be lost.

Number 1, access control is a catalog concern at the moment, but I'm afraid that with vended credentials based on table location, you wouldn't be able to separate a branch (or shallow clone) from its source except at the engine level for something like Trino. That's said if the separate identity is what you are looking for we probably have to make a catalog alias which is again on the catalog side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement PR that improves existing functionality
Projects
None yet
Development

No branches or pull requests

4 participants