-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fix issue #285 : save hive partitioned dataset using NativeExecutionEngine and DaskExecutionEngine #306
[WIP] Fix issue #285 : save hive partitioned dataset using NativeExecutionEngine and DaskExecutionEngine #306
Conversation
I looked at the tests here and from what I see, the only failure is in the
So the issue here appears to be that the added tests support partitioned files as output (which is what the PR is for), but duckdb fails the test because it doesn't support partitioning yet. How do we proceed with this @goodwanghan ? |
fugue_test/builtin_suite.py
Outdated
# TODO: in test below, once issue #288 is fixed, use dag.load instead of pd.read_parquet | ||
pd.testing.assert_frame_equal( | ||
pd.read_parquet(path3).sort_values('a').reset_index(drop=True), | ||
pd.DataFrame({'c': pd.Categorical([6, 2]), 'a': [1, 7]}).reset_index(drop=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we assert c is int
type?
The failure on duckdb test is expected because duckdb itself does not have partitioning feature. But it is quite straightforward to convert duckdbdataframe to arrow table ( |
I apologize for the delay, my daughter was born in Feb, so I am extremely busy recently. The PR looks great, we just need to change duckdb io function to pass the test suite. You can see we have other work items to redesign the IO part for all engines. Because the current design is a bit messy. But please continue working on this PR and merge first. The code refactoring can happen later. As long as we have good unit tests, we can ship this feature first. Thank you! |
Hi! No problem at all with the delay.
Congrats for the recent birth of your daughter!
I'll try to fix the duckdb io function that is failing to pass the test.
I did some prototyping this week to try to fix issue #288 related to
reading of hive partitioned dataset with Native and Dask execution engine.
I think I'm close to a solution in native mode, but it seems more dificult
to deal with dask categorical type, as category values are set to string
type. So the inference of partition column type is more tricky.
Laurent
…On Fri, Mar 4, 2022, 11:24 PM Han Wang ***@***.***> wrote:
I apologize for the delay, my daughter was born in Feb, so I am extremely
busy recently.
The PR looks great, we just need to change duckdb io function to pass the
test suite.
You can see we have other work items to redesign the IO part for all
engines. Because the current design is a bit messy. But please continue
working on this PR and merge first. The code refactoring can happen later.
As long as we have good unit tests, we can ship this feature first.
Thank you!
—
Reply to this email directly, view it on GitHub
<#306 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFZ7LJVWTKVI4ALB4ETAFQTU6KERLANCNFSM5PRIYHVQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
ecosystem and downloads badge
* plugin * update
@LaurentErreca do you plan to finalize this PR soon? |
Hi, yes i hope i'll find the needed time soon!
Laurent
…On Sat, Mar 26, 2022, 10:23 PM Han Wang ***@***.***> wrote:
@LaurentErreca <https://github.com/LaurentErreca> do you plan to finalize
this PR soon?
—
Reply to this email directly, view it on GitHub
<#306 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFZ7LJRYXQZAJSYA5ZK25T3VB553DANCNFSM5PRIYHVQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Upgrading black version
I've been working on this issue. For duckdb to pass the test, it must be able to write hive partitioned dataset. I propose to use arrow with param |
Yes exactly, we should just convert it to arrow and save (only if this partition key is specified). It is straightforward to convert duckdbdataframe to arrow table ( |
Codecov Report
@@ Coverage Diff @@
## master #306 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 101 102 +1
Lines 9480 9511 +31
=========================================
+ Hits 9480 9511 +31
Continue to review full report at Codecov.
|
It looks good you only have a linting issue left @LaurentErreca |
This is done by replacing: |
I think you should use |
fugue_duckdb/_io.py
Outdated
@@ -70,7 +70,7 @@ def save_df( | |||
NotImplementedError(f"{mode} is not supported"), | |||
) | |||
p = FileParser(uri, format_hint).assert_no_glob() | |||
if p.file_format not in self._format_save: | |||
if (p.file_format not in self._format_save) or ("partition_cols" in kwargs): | |||
self._fs.makedirs(os.path.dirname(uri), recreate=True) | |||
ldf = ArrowDataFrame(df.native.arrow()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use ArrowDataFrame(df.as_arrow())
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
Enable saving hive partitioned dataset using Native (pandas) or Dask execution engine. Work still in progress as test_take with Dask is failing.
Related to #285