-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Shallow Clone / Zero Copy Cloning in Apache Iceberg #12263
Comments
+1 |
Can you elaborate how it would be different than branching? |
Hi @RussellSpitzer , While Iceberg already supports branching, this feature differs in the following ways:
|
I don't think that's an issue, all the (meta)data in Iceberg is immutable. What could happen is that the original table progresses, and at some point, the snapshot that the other table cloned off will expire. This will then break the cloned table. This is something that we need to figure out as part of the location ownership #9133 I agree with @RussellSpitzer that this is very similar to branching, and it also overlaps with creating a view of a specific version of the table.
Branches are pretty flexible, and I think it could also work for your use-case here. For example, see write-audit-publish. |
I don't follow these points
Branches do not inherit future changes from source?
Branches don't maintain the complete history, they are essentially just a tag in the metadata.json and while they can allow for other operations to be performed on top of them I'm not sure how that's different than a shallow clone.
What stops a branch being used for testing or experimentation? How would this effect the original table? |
I do think the purpose of branching Iceberg is different from the purpose of cloning. In our case, we want to copy production data to a separate environment. It must be safe, ensuring production data is protected in all situations, separate from any activities on the cloning environment. In BigQuery, it's pretty simple to CLONE a TB table in just a few tens of seconds. Iceberg branching is very useful for wap pattern, but it is not that safe. Let's say, you switch to another branch using Spark.
Then drop the table:
Then say goodbye to your data. |
If a shallow clone shares the data files of the original table how is that different? This is what i'm getting at. What are we looking for here that doesn't exist within branching? What features specifically do we want to add or change? |
My case copying production data to a separate environment. Example:
I can then perform any operation on |
Again, how is this different than a branch? Are you just saying you want a branch with a different catalog identifier? |
Some differences:
|
Only 1 is a difference here and it's at the catalog level. A branch cannot affect files in another branch. If you delete a snapshot from 1 branch but it still exists in another branch it won't be lost. Number 1, access control is a catalog concern at the moment, but I'm afraid that with vended credentials based on table location, you wouldn't be able to separate a branch (or shallow clone) from its source except at the engine level for something like Trino. That's said if the separate identity is what you are looking for we probably have to make a catalog alias which is again on the catalog side. |
Feature Request / Improvement
Description: I would like to request a feature similar to Databricks' Shallow Clone or Snowflake's Zero Copy Cloning in Apache Iceberg. This feature would enable users to create a new Iceberg table that references the same underlying data files as an existing table without duplicating storage.
Motivation: Currently, Iceberg supports snapshot-based branching and time-travel capabilities, but it does not provide a mechanism to create a "cloned" table that references existing data without copying it. Introducing a shallow clone feature would provide several benefits:
Storage Efficiency: Avoids unnecessary duplication of data files, reducing storage costs.
Fast Table Creation: Enables near-instant table creation, as only metadata needs to be managed.
Flexible Data Management: Supports use cases such as testing, experimentation, and versioning without physical data replication.
Proposed Solution: The implementation could leverage Iceberg’s metadata layer to create a new table with the same data files as an existing table, while allowing future modifications to be independent. Key considerations:
The cloned table should inherit the snapshot of the source table at the time of cloning.
Future writes to the cloned table should create new data files, without affecting the source table.
Clones should support metadata-based optimizations such as compaction and partition pruning.
Optional: Ability to specify whether metadata updates (e.g., schema changes) in the source table should propagate to the clone.
Query engine
Spark
Trino
StarRocks
Willingness to contribute
The text was updated successfully, but these errors were encountered: