Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Add documentation for file-based metastore #24511

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

steveburnett
Copy link
Contributor

@steveburnett steveburnett commented Feb 6, 2025

Description

Add how to configure Presto to use a file-based hms to installation/deployment.rst.

Motivation and Context

@ethanyzhang suggested this would be a good addition to the Presto documentation in an internal discussion that @nmahadevuni contributed the configuration in. I discussed where such information would fit best in the Presto documentation with @tdcmeehan.

Impact

Documentation. Readers wanting to try out Presto quickly can bypass the need for the steps in Configure Hive MetaStore.

Test Plan

Local doc build. Screenshot with existing text above and below included for context.
Screenshot 2025-02-06 at 4 41 07 PM

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add documentation for file-based Hive metastore to :doc:`/connector/file-based-metastore`.

@steveburnett steveburnett requested review from elharo and a team as code owners February 6, 2025 21:52
@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Feb 6, 2025
@prestodb-ci prestodb-ci requested review from a team, psnv03 and NivinCS and removed request for a team February 6, 2025 21:52
@steveburnett steveburnett self-assigned this Feb 6, 2025
@github-actions github-actions bot added the docs label Feb 6, 2025
@majetideepak
Copy link
Collaborator

@steveburnett I have an issue here with some more details
#19112
We need to list the restrictions as well. file-based metastore does not support partitioning for example. @nmahadevuni should confirm

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For testing purposes Presto can be configured to use a local directory as a Hive
Metastore. Set the following properties in ``etc/config.properties``:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metastore concept applies only to Hive and Lakehouse Connectors (Iceberg, Delta, and Hudi), so we should probably mention it.

Additionally, these properties are catalog properties and must be added to etc/catalog/catalog_name.properties valid only for the above connectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do!

@imjalpreet
Copy link
Member

file-based meta store does not support partitioning

@majetideepak I haven't used the file-based Hive metastore much myself, but did you encounter any issues when trying to create partitioned tables?

Ideally, it should be possible and I can see that partition-specific metadata calls are implemented even for the FileHiveMetastore.

https://github.com/prestodb/presto/blob/master/presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/file/FileHiveMetastore.java

@ethanyzhang
Copy link
Contributor

@imjalpreet should this be in the hive.properties or config.properties?

@imjalpreet
Copy link
Member

should this be in the hive.properties or config.properties?

@ethanyzhang These should be in the catalog properties file. I also added a review comment above.

@steveburnett
Copy link
Contributor Author

steveburnett commented Feb 6, 2025

@steveburnett I have an issue here with some more details #19112 We need to list the restrictions as well. file-based metastore does not support partitioning for example. @nmahadevuni should confirm

Thanks for the additional information @majetideepak!

With the new information in #19112, I am going to move this topic from where I initially put it in this PR as a small topic in Deploying Presto.

I thought about moving it into the Hive Connector doc, but as it is relevant to "Hive and Lakehouse Connectors (Iceberg, Delta, and Hudi)" I think I will move it to its own page in /installation and include Deepak's instructions how to use it, which will be a big help to readers.

Copy link
Member

@imjalpreet imjalpreet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to include this in the connectors section?

Right now, metastore properties are spread across different parts of the Hive Connector documentation, mainly in Hive Configuration Properties, Metastore Configuration Properties, and Glue Configuration Properties.

Perhaps we could extract the metastore-related documentation into a separate subsection, similar to Hive Security. This would make sense since metastore properties are relevant not just to Hive, but also to other connectors like Iceberg, Delta, and Hudi.

@majetideepak
Copy link
Collaborator

Ideally, it should be possible and I can see that partition-specific metadata calls are implemented even for the FileHiveMetastore.

@imjalpreet I remember seeing some issue with partitions and file metastore. It could be due to my setup. I think its good to test this once before documenting.

@hantangwangd
Copy link
Member

I use file bases HMS for development environment, just confirmed again that it supports Hive partitioned tables. Or maybe you encountered some specific problems when using partitioned tables with file bases HMS @majetideepak?

Configuring a File-Based Metastore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For testing purposes Presto can be configured to use a local directory as a Hive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also mentioned that it is also very suitable for developing purpose?

@majetideepak
Copy link
Collaborator

@hantangwangd If you use it, then it's good. My issue was likely related to my setup.

hive.metastore.catalog.dir=file:///<catalog-dir>

Replace ``<catalog-dir>`` in the example with the path to a directory on the
local filesystem that is relative to the Presto installation directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path need not be relative to the installation directory. It can be any path on the local filesystem.

@nmahadevuni
Copy link
Member

@majetideepak I have used partitioned tables too recently with file based HMS. I didn't have any issues.

@steveburnett steveburnett force-pushed the steveburnett-file-based-hms branch from 03c832f to 35be659 Compare February 7, 2025 16:52
@steveburnett steveburnett changed the title [docs] Add file-based metastore config to deployment.rst [docs] Add documentation for file-based metastore Feb 7, 2025
@steveburnett
Copy link
Contributor Author

Hi everyone, thanks for your feedback! I took @imjalpreet's suggestion and moved this to a separate page in /connector, corrected the directory path that @nmahadevuni noted, added text describing the supported connectors and corrected the instructions for the properties file to the connector property file, and added usage examples from @majetideepak's #19112.

I welcome everyone to review again and comment with new corrections and additions.


For testing or developing purposes, Presto can be configured to use a filesystem
directory as a Hive Metastore. This can be a directory on the local filesystem
or a non-local file system such as Amazon S3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is currently supported only on local file system. S3 is not supported. Please remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sourced that from #19112, but if it's not supported then I can remove it, yes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified everything in #19112. Does not hurt to test again since its been a while.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it do not support HDFS or S3, but we can make some small changes to let it support all of them for test or development environment (that is to say, not safe in production environment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that helps!

If HDFS and S3 are not currently supported, then I will remove the reference to S3 and write the doc as "supported only on local file system". When the "changes [are made] to let it support all of them for test or development environment", I can update the doc then to add mention that HDFS and S3 are optional for test or dev only.


CREATE SCHEMA hive.warehouse;

This query creates a folder as ``/data/hive_data/warehouse``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's specify that this happens if the user gave /data/hive_data/ as the path for hive.metastore.catalog.dir

for other file formats such as CSV, the user must either

* manually specify the schema as shown in the example above
* provide ``.prestoSchema`` and ``.prestoPermissions`` files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we verify this with a test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this content from #19112, didn't test it personally. @majetideepak, does this work as described?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this. I also now remember that this is where partitioning gets tricky.

@majetideepak
Copy link
Collaborator

@steveburnett, @hantangwangd, @nmahadevuni
The file-based metastore approach is also useful for reading existing files.
I tested all the content in #19112
It does not cover partitioning. We could cover partitioning in a separate PR.

@nmahadevuni
Copy link
Member

Since we are mentioning one can provide .prestoSchema and .prestoPermissions for existing data, is there any format or steps on how to create these? @majetideepak

@steveburnett steveburnett force-pushed the steveburnett-file-based-hms branch from 35be659 to a0a7ea5 Compare February 7, 2025 18:13
@steveburnett
Copy link
Contributor Author

Hi everyone! I revised based on the feedback from @nmahadevuni, @majetideepak, and @hantangwangd. PTAL when you can.

Specific open questions I haven't addressed in this update:

  • what I should say or remove from the current draft about partitioning. Should I add a sentence in the Overview "Partitioning with file-based metastores is not supported."?

  • what I should add about "provide .prestoSchema and .prestoPermissions files", or remove that line entirely?

@majetideepak
Copy link
Collaborator

majetideepak commented Feb 7, 2025

@steveburnett Let's remove Reading Existing Data Files with a File-based Metastore from this PR since both the open questions are related to that.
I believe @imjalpreet mentioned that partitioning works when you create the table with file based metastore.

@steveburnett steveburnett force-pushed the steveburnett-file-based-hms branch from a0a7ea5 to a915d6f Compare February 7, 2025 21:38
@steveburnett
Copy link
Contributor Author

@steveburnett Let's remove Reading Existing Data Files with a File-based Metastore from this PR since both the open questions are related to that. I believe @imjalpreet mentioned that partitioning works when you create the table with file based metastore.

Done and done, thanks! PTAL.

Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @steveburnett.

@nmahadevuni
Copy link
Member

Thank you @steveburnett.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steveburnett should we create a dedicated sub-section for Hive Metastore instead of just “File Hive Metastore”? We could consolidate all metastore-related information from the Hive connector documentation into this section to keep it organized and less scattered.

* :doc:`/connector/hudi`
* :doc:`/connector/iceberg`

Partitioned tables are supported when created in the file-based metastore.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe this can be removed since it's expected behaviour, we could have mentioned if it wasn't supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs from:IBM PR from IBM
Projects
Status: 🆕 Unprioritized
Development

Successfully merging this pull request may close these issues.

8 participants