-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: MERGE/Upsert Support #1534
Conversation
@kevinjqliu - i'm doing this work on behalf of my company and when i ran my tests, i used a standard python virtual environment venv; i haven't figured out quite yet how to get poetry to work inside my companies firewall. So, not sure if those are errors I can address or if someone else can pitch in here. |
@mattmartin14, what's going on man!? Thanks for working on this and most impressively thanks for the comprehensive description. Out of curiosity, did you discuss your approach with anyone before putting this together? This is good but a few flags for OSS contributions to lower the upfront back and forth:
Contributing to OSS has a different focus than internal code, so hopefully these help. This does look well thought out in terms of implementation, but performance should be second or third try in favor of having code history that everyone in the community can wrap their heads around. I'd suggest to get these addressed before Fokko and Kevin scan it. I'll be happy to do a quick glance once the tests are running and there's some consensus around datafusion. PR number one yeah! |
Thanks @bitsondatadev for all this great feedback. I'll get working on your suggestions and push an update next week and will address all your concerns. |
Thanks @mattmartin14 for the PR! And thanks @bitsondatadev on the tips on working in OSS. I certainly had to learn a lot of these over the years. A couple things I think we can address first.
This has been a much anticipated and asked feature in the community. Issue #402 has been tracking it with many eyes on it. I think we still need to figure out the best approach to support this feature. Like you mentioned in the description, As we’re building out more of more engine-like features, it becomes harder to support more complex and data-intensive workloads such as MERGE INTO. We have been able to use pyarrow for query processing but it has its own limitations. For more compute intensive workloads, such as Bucket and Truncate transform, we were able to leverage rust (iceberg-rust) to handle the computation. Looking at #402, I don’t see any concrete plans on how we can support MERGE INTO. I’ve added this as an agenda on the monthly pyiceberg sync and will post the update. Please join us if you have time!
I’m very interested in exploring datafusion and ways we can leverage it for this project. As I mentioned above, we currently use pyarrow to handle most of the compute. It’ll be interesting to evaluate datafusion as an alternative. Datafusion has its own ecosystem of expression api, dataframe api, and runtime. All of which are good complements to pyiceberg. It has integrations with the rust side as well, something I have started exploring in apache/iceberg-rust#865 That said, I think we need a wider discussion and alignment on how to integrate with datafusion. It’s a good time to start thinking about it! I’ve added this as another discussion item on the monthly sync.
Compute intensive workloads are generally a bottleneck in python. I am excited for future pyiceberg <> iceberg-rust integration where we can leverage rust to perform those computations.
This is an interesting observation and I think I’ve seen someone else run into this issue before. We’d want to address this separately. This is something we might want to explore using datafusion’s expression api to replace our own parser. |
@kevinjqliu @Fokko @bitsondatadev - the issues should be resolved. I got poetry working in my company's firewall; i've also removed the dead code and added the license headers to each file. please take a look |
also - i added datafusion to the poetry toml file and lock and it appears that you all need to resolve the conflict here, as it's not letting me. |
Also @kevinjqliu - To address your question on datafusion. When I looked into this feature, I explored these 3 options for an arrow processing engine:
I ultimately decided that datafusion would make the most sense, given these things it had going:
Hope this helps on how I arrived at that conclusion. Just using native pyarrow to try and process the data would be a very large uphill battle as we would effectively have to build our own data processing engine with it e.g. hash joins, sorting, optimizations, etc. I figured it does not make sense to reinvent the wheel and instead use an engine that is already out there (datafusion) and put it to good use. I took a look at the attachment you posted for any upcoming meetings for the pyiceberg sync, but did not see any 2025 meetings listed. I'd be glad to attend to discuss this further, if needed. Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @mattmartin14. There is some work to be done here, mostly because we pull the table into memory, and then perform the operation, which defeats the purpose of Iceberg because we don't use the statistics to optimize the query. I left a few comments
Hi All, I think if we tackle the basic merge/upsert pattern (when matched update all, when not matched insert all), that would cover 90% of most merge use cases. For things that require a more involved upsert using multiple matched predicates, we should direct the users to use spark sql, since that is already baked in to the platform. If anyone disagrees with that directional statement, please let me know. |
@Fokko - I've updated the PR to include the pred code you provided to pre-filter the iceberg table, thus avoid loading it all into memory. I was also able to borrow that same pred code as the overwrite_filter later down. Thanks; i think this PR is starting to look in good shape. The only oustanding items I see are if we want to rename "merge_rows" to "merge". I added some comments on that thread and would like to know your thoughts. Thanks, |
…what actually needs to get updated
@Fokko - i added some additional smoke tests to test for situations where the primary key is a string or a date; the filter list code you wrote works fine for ints and strings, but on dates, i'm getting a type error such as this: TypeError: Invalid literal value: datetime.date(2021, 1, 1) For reference, here is the function to help jog your memory. Do you know how we can handle updating this function to handle situations where a date is a joined column? def get_filter_list(df: pyarrow_table, join_cols: list) -> BooleanExpression:
unique_keys = df.select(join_cols).group_by(join_cols).aggregate([])
pred = None
if len(join_cols) == 1:
pred = In(join_cols[0], unique_keys[0].to_pylist())
else:
pred = Or(*[
And(*[
EqualTo(col, row[col])
for col in join_cols
])
for row in unique_keys.to_pylist()
])
return pred |
trying another push on the table init; found a white space issue |
woot! Thanks! can you run |
I thought I pushed the fixes on the vendor facebook, hive metastore stuff already? is that not picked up? |
@kevinjqliu - I saw i pushed the hive metastore and ttypes, i didnt push your changes on fb303; LMK if you still need me to run a checkout |
the changes to could you try something like
assuming that |
Hey @kevinjqliu - i ran git remote -v and this is what i got: github [email protected]:apache-iceberg-python.git (fetch)
github [email protected]:apache-iceberg-python.git (push)
origin https://github.com/StateFarmIns/iceberg-python.git (fetch)
origin https://github.com/StateFarmIns/iceberg-python.git (push) I'm not really sure where to go from here. I tried the commands: git fetch origin main which worked, followed by: git checkout origin main/vendor and got this error: error: pathspec 'main/vendor' did not match any file(s) known to git Any thoughts? |
in this case, the right alias is
so
mind the space before |
Sorry @kevinjqliu - that did not work; we have the iceberg-python project forked to https://github.com/StateFarmIns/iceberg-python. When i try to run git fetch github main, i get a timeout on my side (firewall problem), can you let me know what changes you want done? I thought I redid the vendor files based on what you provided earlier, so i'm surprised things are still out of sync. |
ah thats odd. lets do this the manual way for these files
copy/paste the files from https://github.com/apache/iceberg-python |
@kevinjqliu - I reran those. I'm now geting a hash conflict on the poetry lock file. Can you help produce the correct file again so i can resolve? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for driving this @mattmartin14 🙌 I believe this looks good, there are some files touched that are not needed
Co-authored-by: Fokko Driesprong <[email protected]>
@Fokko , @kevinjqliu - i think we are good on teh vendor files; i just need your help one more time computing the poetry.lock file, since if i do it on my end, it will embed our private state farm repo on nearly every line |
@Fokko @kevinjqliu - for context on this poetry lock file issue that keeps coming up: If i run poetry.lock locally, it wil embed my company's private repo throughout the file on hundreds of lines. I know this open source project is getting constant updates, so this is a tough game of catch-me-if-you-can. So i need one of you to rerun the poetry lock file with datafusion as a dependency if you don't mind, upload it to a separate branch so i can pull it down manually. as an FYI - i've provided our internal OSS team with this issue, given in the long run, we will want somthing more sustainable so i can continue to contribute to the project. But for now, i need your help running the pyproject.toml and poetry.lock files. Thanks, |
i've tried something a little different to resolve the lock file conflict; i'll see if it works; added my company's repo with the option of priority==explicit; it looks like it might have prevented doing a bunch of insertions |
looks like there are some changes that we previous removed that are now back :/ @mattmartin14 do you mind if i push these commits to a new PR using my fork so I can make the necessary changes? |
That's fine, go ahead and do it. Team work makes the dream work! |
Closes #402 This PR adds the `upsert` function to the `Table` class and supports the following upsert operations: - when matched update all - when not matched insert all This PR is a remake of #1534 due to some infrastructure issues. For additional context, please refer to that PR. --------- Co-authored-by: VAA7RQ <[email protected]> Co-authored-by: VAA7RQ <[email protected]> Co-authored-by: mattmartin14 <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]>
Final stage of this PR was moved to new PR #1660 due to some infrastructure challenges on my end. Closing this one now. |
Hi,
This PR adds basic upsert functionality to pyiceberg. It also addresss PR #402.
An upsert is an operation that combines an update and an insert into a table, in this case an Iceberg table. What makes an upsert unique is that its wrapped in a single transaction, so either both the update and insert succeed together, or if one fails, the entire operation fails.
To illustrate a simple example of how an upsert will function, let's assume we have the following pyarrow dataframe as our source (input):
And we have the following iceberg table as our target that we want to upsert our source input to:
In this given example, let's assume our primary key (join column is the cust_id); our upsert function will perform the following operations on the target iceberg table:
** Please note: cust_id = 1 exists in both the source and target tables. But, given the non-key columns (cust_name, age) have not changed, we will not update that row; this avoids unnecessary I/O.
When I originally submitted this PR, I was using Apache Datafusion as the data processing engine to determine the rows eligible for an update and rows that needed to be inserted. After some iterations and discussion, the pyiceberg team reached a consensus that for now, we will not introduce datafusion into pyiceberg as a dependency; instead, this feature uses the pyarrow compute engine to determine the rows eligible for updates and inserts.
This feature will initially support the following upsert opertations:
Down the road, we could potentially enhance this feature to handle more upsert predicates and also work on performance for larger tables. For now, we wanted to get the basic functionality in the hands of developers.
Thanks,
Matt Martin