feat(java): support set options for spark datasource api #3366

SaintBacchus · 2025-01-10T09:46:14Z

use option to set options for lance spark datasource like read version:

spark
.read()
.format("lance")
.option("version", "1")
.load(LanceConfig.getDatasetUri(dbPath, TestUtils.TestTable1Config.datasetName));

Why do it in this way
The spark data source API will change the LanceIdentifier into IdentifierImpl and IdentifierImpl only have namespaces and name without the options in the data source.
How to do that
Inspired by Iceberg putting the options in the name. I designed the name LanceIdentifier with the format name#key1=value1&key2=value2. These options will only set the read and write options. The storage options should be set in spark configuration and not in option since the AK/SK format is complex.

SaintBacchus · 2025-01-10T09:49:09Z

java/spark/src/main/java/com/lancedb/lance/spark/LanceIdentifier.java

+      System.arraycopy(namespace, 0, this.namespace, 0, namespace.length);
+      this.namespace[namespace.length] = SEPARATOR;
+      int i = namespace.length + 1;
+      for (Map.Entry<String, String> entry : options.entrySet()) {


As the iceberg also need to put the options in path, the LanceIdentifier put he option in the namespace.
https://github.com/apache/iceberg/blob/fc923b3af65b0e3cb28a9afb69f7fd05c88f62ca/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java#L129

@chenkovsky

If the following test can be supported. It will save my day. we can read lance with pure spark sql.

@Test void versionInSQL() { String uri = LanceConfig.getDatasetUri(dbPath, TestUtils.TestTable1Config.datasetName); uri += "#####version=1"; String sql = "SELECT * FROM lance.`" + uri + "`"; Dataset<Row> df = spark.sql(sql); assertEquals(2, df.count()); }

Is this common use of spark data source？

I don't know. but I think this is reasonable. I also want to listen other's advice.

wjones127 · 2025-01-10T16:47:30Z

java/spark/src/main/java/com/lancedb/lance/spark/LanceIdentifier.java

+ * dataset URI and the namespace. The namespace is an array of strings, which contains the namespace
+ * of the dataset and the options. The options are key-value pairs, which are separated by "#####".


What is the purpose of the namespace? Why does it store a serialized copy of the options?

The namespace was stand for catalog and database name. In this case, the catalog load table and create table will convert Lance Identifier into IdentifierImpl which only have namespaces and name. The way Iceberg does is add options in the path stored in name. I store the options in namespaces.

Maybe we can wait until the design of Lance Catalog landing?

The datasource api is not depend on catalog.
The AI user maybe do not use the catalog.

Ok, I checked this file has existed before, not introduced recently. Will review it later.

yanghua

Left some comments.

yanghua · 2025-01-13T11:21:30Z

java/spark/src/main/java/com/lancedb/lance/spark/LanceIdentifier.java

 public class LanceIdentifier implements Identifier {
-  private final String[] namespace = new String[] {"default"};
+  public static final String SEPARATOR = "#####";


Is #### a customary separator or delimiter in spark ecosystem? We may need some comments or remarks to describe its purpose?

yanghua · 2025-01-13T11:33:41Z

java/spark/src/test/java/com/lancedb/lance/spark/read/SparkConnectorReadTest.java

+            .format("lance")
+            .option("version", "1")
+            .load(LanceConfig.getDatasetUri(dbPath, TestUtils.TestTable1Config.datasetName));
+    assertEquals(2, df.count());


Can we add some description of why there is 2 to make the assertion more readable? The first version, wrote 2 records?

SaintBacchus · 2025-01-14T14:09:05Z

Maybe it's not a good way to implement this case. Mark it draft first

SaintBacchus · 2025-01-24T03:14:08Z

@yanghua @wjones127 @chenkovsky please review it again now. I put the read and write options in the lance uri and removing from the namespaces as the Iceberg does.

SaintBacchus requested a review from yanghua January 10, 2025 09:46

github-actions bot added enhancement New feature or request java labels Jan 10, 2025

SaintBacchus commented Jan 10, 2025

View reviewed changes

wjones127 reviewed Jan 10, 2025

View reviewed changes

yanghua reviewed Jan 13, 2025

View reviewed changes

SaintBacchus marked this pull request as draft January 14, 2025 02:59

SaintBacchus added 3 commits January 24, 2025 10:59

feat(java): support set options for spark datasource api

62d36c6

fix java ut

8359699

set options in url

78ea88a

SaintBacchus force-pushed the LanceSparkId branch from 6df2f49 to 78ea88a Compare January 24, 2025 02:59

SaintBacchus marked this pull request as ready for review January 24, 2025 03:12

SaintBacchus mentioned this pull request Jan 24, 2025

Improve spark data source for lance #3260

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(java): support set options for spark datasource api #3366

feat(java): support set options for spark datasource api #3366

SaintBacchus commented Jan 10, 2025 •

edited

Loading

SaintBacchus Jan 10, 2025

SaintBacchus Jan 10, 2025

chenkovsky Jan 11, 2025

SaintBacchus Jan 13, 2025

chenkovsky Jan 14, 2025

wjones127 Jan 10, 2025

SaintBacchus Jan 11, 2025

yanghua Jan 13, 2025

SaintBacchus Jan 13, 2025

yanghua Jan 13, 2025

yanghua left a comment

yanghua Jan 13, 2025

yanghua Jan 13, 2025 •

edited

Loading

SaintBacchus commented Jan 14, 2025

SaintBacchus commented Jan 24, 2025 •

edited

Loading

		* dataset URI and the namespace. The namespace is an array of strings, which contains the namespace
		* of the dataset and the options. The options are key-value pairs, which are separated by "#####".

feat(java): support set options for spark datasource api #3366

Are you sure you want to change the base?

feat(java): support set options for spark datasource api #3366

Conversation

SaintBacchus commented Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanghua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanghua Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

SaintBacchus commented Jan 14, 2025

SaintBacchus commented Jan 24, 2025 • edited Loading

SaintBacchus commented Jan 10, 2025 •

edited

Loading

yanghua Jan 13, 2025 •

edited

Loading

SaintBacchus commented Jan 24, 2025 •

edited

Loading