Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(java): support set options for spark datasource api #3366

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

SaintBacchus
Copy link
Collaborator

@SaintBacchus SaintBacchus commented Jan 10, 2025

use option to set options for lance spark datasource like read version:

spark
.read()
.format("lance")
.option("version", "1")
.load(LanceConfig.getDatasetUri(dbPath, TestUtils.TestTable1Config.datasetName));
  • Why do it in this way
    The spark data source API will change the LanceIdentifier into IdentifierImpl and IdentifierImpl only have namespaces and name without the options in the data source.

  • How to do that
    Inspired by Iceberg putting the options in the name. I designed the name LanceIdentifier with the format name#key1=value1&key2=value2. These options will only set the read and write options. The storage options should be set in spark configuration and not in option since the AK/SK format is complex.

@SaintBacchus SaintBacchus requested a review from yanghua January 10, 2025 09:46
@github-actions github-actions bot added enhancement New feature or request java labels Jan 10, 2025
System.arraycopy(namespace, 0, this.namespace, 0, namespace.length);
this.namespace[namespace.length] = SEPARATOR;
int i = namespace.length + 1;
for (Map.Entry<String, String> entry : options.entrySet()) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the following test can be supported. It will save my day. we can read lance with pure spark sql.

  @Test
  void versionInSQL() {
    String uri = LanceConfig.getDatasetUri(dbPath, TestUtils.TestTable1Config.datasetName);
    uri += "#####version=1";
    String sql = "SELECT * FROM lance.`" + uri + "`";
    Dataset<Row> df = spark.sql(sql);
    assertEquals(2, df.count());
  }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this common use of spark data source?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. but I think this is reasonable. I also want to listen other's advice.

Comment on lines 24 to 25
* dataset URI and the namespace. The namespace is an array of strings, which contains the namespace
* of the dataset and the options. The options are key-value pairs, which are separated by "#####".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the namespace? Why does it store a serialized copy of the options?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The namespace was stand for catalog and database name. In this case, the catalog load table and create table will convert Lance Identifier into IdentifierImpl which only have namespaces and name. The way Iceberg does is add options in the path stored in name. I store the options in namespaces.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can wait until the design of Lance Catalog landing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The datasource api is not depend on catalog.
The AI user maybe do not use the catalog.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I checked this file has existed before, not introduced recently. Will review it later.

Copy link
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

public class LanceIdentifier implements Identifier {
private final String[] namespace = new String[] {"default"};
public static final String SEPARATOR = "#####";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is #### a customary separator or delimiter in spark ecosystem? We may need some comments or remarks to describe its purpose?

.format("lance")
.option("version", "1")
.load(LanceConfig.getDatasetUri(dbPath, TestUtils.TestTable1Config.datasetName));
assertEquals(2, df.count());
Copy link
Collaborator

@yanghua yanghua Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some description of why there is 2 to make the assertion more readable? The first version, wrote 2 records?

@SaintBacchus SaintBacchus marked this pull request as draft January 14, 2025 02:59
@SaintBacchus
Copy link
Collaborator Author

Maybe it's not a good way to implement this case. Mark it draft first

@SaintBacchus SaintBacchus marked this pull request as ready for review January 24, 2025 03:12
@SaintBacchus
Copy link
Collaborator Author

SaintBacchus commented Jan 24, 2025

@yanghua @wjones127 @chenkovsky please review it again now. I put the read and write options in the lance uri and removing from the namespaces as the Iceberg does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request java
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants