fix

4paradigm · Jan 30, 2024 · febfd4d · febfd4d
1 parent 6989179
commit febfd4d
Show file tree

Hide file tree

Showing 6 changed files with 46 additions and 15 deletions.
diff --git a/docs/en/integration/offline_data_sources/hive.md b/docs/en/integration/offline_data_sources/hive.md
@@ -122,7 +122,7 @@ LOAD DATA INFILE 'hive://db1.t1' INTO TABLE db1.t1 OPTIONS(deep_copy=true, sql='
 
 Exporting data to Hive sources is facilitated through the API [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md), which employs a distinct URI format, `hive://[db].table`, to seamlessly transfer data to the Hive data warehouse. Here are some key considerations:
 
-- If you omit specifying a database name, the default database name used will be `default_Db`.
+- If you omit specifying Hive database name, the default database used in Hive will be `default`.
 - When a database name is explicitly provided, it's imperative that the database already exists. Currently, the system does not support the automatic creation of non-existent databases.
 - In the event that the designated Hive table name is absent, the system will automatically generate a table with the corresponding name within the Hive environment.
 - The `OPTIONS` parameter exclusively takes effect within the export mode of `mode`. Other parameters do not exert any influence.

diff --git a/docs/en/integration/offline_data_sources/iceberg.md b/docs/en/integration/offline_data_sources/iceberg.md
@@ -90,11 +90,11 @@ LOAD DATA INFILE 'iceberg://hive_prod.db1.t1' INTO TABLE db1.t1 OPTIONS(deep_cop
 
 ## Export OpenMLDB Data to Iceberg
 
-Exporting data to Hive sources is facilitated through the API [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md), which employs a distinct URI format, `iceberg://[catalog].[db].table`, to seamlessly transfer data to the Iceberg data warehouse. Here are some key considerations:
+Exporting data to Iceberg sources is facilitated through the API [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md), which employs a distinct URI format, `iceberg://[catalog].[db].table`, to seamlessly transfer data to the Iceberg data warehouse. Here are some key considerations:
 
-- If you omit specifying a database name, the default database name used will be `default_Db`. TODO?
-- When a database name is explicitly provided, it's imperative that the database already exists. Currently, the system does not support the automatic creation of non-existent databases.
-- In the event that the designated Hive table name is absent, the system will automatically generate a table with the corresponding name within the Hive environment.
+- If you omit specifying Iceberg database name, the default database used in Iceberg will be `default`.
+- When Iceberg database name is explicitly provided, it's imperative that the database already exists. Currently, the system does not support the automatic creation of non-existent databases.
+- In the event that the designated Iceberg table name is absent, the system will automatically generate a table with the corresponding name within the Hive environment.
 - The `OPTIONS` parameter exclusively takes effect within the export mode of `mode`. Other parameters do not exert any influence.
 
 For example: 

diff --git a/docs/zh/integration/offline_data_sources/hive.md b/docs/zh/integration/offline_data_sources/hive.md
@@ -33,7 +33,7 @@
   - taskmanager.properties: 在配置项 `spark.default.conf` 中加入`spark.hadoop.hive.metastore.uris=thrift://...` ，随后重启taskmanager。
   - CLI: 在 ini conf 中加入此配置项，并使用`--spark_conf`启动CLI，参考[客户端Spark配置文件](../../reference/client_config/client_spark_config.md)。
 
-- hive-site.xml：你将HIVE的配置 `hive-site.xml` 放入 Spark home的`conf/`（如果已配置`HADOOP_CONF_DIR`环境变量，也可以将配置文件放入`HADOOP_CONF_DIR`中）。`hive-site.xml` 样例：
+- hive-site.xml：你可以将HIVE的配置 `hive-site.xml` 放入 Spark home的`conf/`（如果已配置`HADOOP_CONF_DIR`环境变量，也可以将配置文件放入`HADOOP_CONF_DIR`中）。`hive-site.xml` 样例：
 
   ```xml
   <configuration>
@@ -122,7 +122,7 @@ LOAD DATA INFILE 'hive://db1.t1' INTO TABLE db1.t1 OPTIONS(deep_copy=true, sql='
 
 对于 Hive 数据源的导出是通过 API [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md) 进行支持，通过使用特定的 URI 接口 `hive://[db].table` 的格式进行导出到 Hive 数仓。注意：
 
-- 如果不指定数据库名字，则会使用默认数据库名字 `default_db`
+- 如果不指定Hive数据库名字，则会使用Hive默认数据库 `default`
 - 如果指定数据库名字，则该数据库必须已经存在，目前不支持对于不存在的数据库进行自动创建
 - 如果指定的Hive表名不存在，则会在 Hive 内自动创建对应名字的表
 - `OPTIONS` 参数只有导出模式`mode`生效，其他参数均不生效

diff --git a/docs/zh/integration/offline_data_sources/iceberg.md b/docs/zh/integration/offline_data_sources/iceberg.md
@@ -41,12 +41,42 @@ spark.default.conf=spark.sql.catalog.rest_prod=org.apache.iceberg.spark.SparkCat
 
 Iceberg catalog的完整配置参考[Iceberg Catalog Configuration](https://iceberg.apache.org/docs/latest/spark-configuration/)。
 
+任一配置成功后，均使用`<catalog_name>.<db_name>.<table_name>`的格式访问Iceberg表。如果不想使用`<catalog_name>`，可以在配置中设置`spark.sql.catalog.default=<catalog_name>`。也可添加`spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog`，`spark.sql.catalog.spark_catalog.type=hive`，让iceberg catalog合入spark catalog中（非iceberg表仍然存在于spark catalog中），这样可以使用`<db_name>.<table_name>`的格式访问Iceberg表。
+
 ### 调试信息
 
-当你从Iceberg导入数据时，你可以检查任务日志，确认任务是否读取了源数据。
+成功连接Iceberg Hive Catalog后，你可以在日志中看到类似以下的信息：
+
+```
+24/01/30 09:01:05 INFO SharedState: Setting hive.metastore.warehouse.dir ('hdfs://namenode:19000/user/hive/warehouse') to the value of spark.sql.warehouse.dir.
+24/01/30 09:01:05 INFO SharedState: Warehouse path is 'hdfs://namenode:19000/user/hive/warehouse'.
+...
+24/01/30 09:01:06 INFO HiveUtils: Initializing HiveMetastoreConnection version 2.3.9 using Spark classes.
+24/01/30 09:01:06 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.9) is hdfs://namenode:19000/user/hive/warehouse
+24/01/30 09:01:06 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
+24/01/30 09:01:06 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
+24/01/30 09:01:06 INFO HiveMetaStore: 0: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore
+24/01/30 09:01:06 INFO ObjectStore: ObjectStore, initialize called
+24/01/30 09:01:06 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
+24/01/30 09:01:06 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
+24/01/30 09:01:07 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
+24/01/30 09:01:07 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is POSTGRES
+24/01/30 09:01:07 INFO ObjectStore: Initialized ObjectStore
+24/01/30 09:01:08 INFO HiveMetaStore: Added admin role in metastore
+24/01/30 09:01:08 INFO HiveMetaStore: Added public role in metastore
+24/01/30 09:01:08 INFO HiveMetaStore: No user is added in admin role, since config is empty
+24/01/30 09:01:08 INFO HiveMetaStore: 0: get_database: default
+```
+
+导出到Iceberg时，你可以检查任务日志，应该有类似以下的信息：
 
 ```
-INFO ReaderImpl: Reading ORC rows from
+24/01/30 09:57:29 INFO AtomicReplaceTableAsSelectExec: Start processing data source write support: IcebergBatchWrite(table=nyc.taxis_out, format=PARQUET). The input RDD has 1 partitions.
+...
+24/01/30 09:57:31 INFO AtomicReplaceTableAsSelectExec: Data source write support IcebergBatchWrite(table=nyc.taxis_out, format=PARQUET) committed.
+...
+24/01/30 09:57:31 INFO HiveTableOperations: Committed to table hive_prod.nyc.taxis_out with the new metadata location hdfs://namenode:19000/user/hive/iceberg_storage/nyc.db/taxis_out/metadata/00001-038d8b81-04a6-4a19-bb83-275eb4664937.metadata.json
+24/01/30 09:57:31 INFO BaseMetastoreTableOperations: Successfully committed to table hive_prod.nyc.taxis_out in 224 ms
 ```
 
 ## 数据格式
@@ -90,9 +120,9 @@ LOAD DATA INFILE 'iceberg://hive_prod.db1.t1' INTO TABLE t1 OPTIONS(deep_copy=fa
 
 从 OpenMLDB 导出数据到 Iceberg 表，需要使用 [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md) 语句，这个语句使用特殊的 URI 格式 `iceberg://[db].table`，可以无缝地导出数据到 Iceberg 表。以下是一些重要的注意事项：
 
-- 如果不指定数据库名字，则会使用默认数据库名字 `default_db` TODO
-- 如果指定数据库名字，则该数据库必须已经存在，目前不支持对于不存在的数据库进行自动创建
-- 如果指定的Hive表名不存在，则会在 Hive 内自动创建对应名字的表
+- 如果不指定Iceberg数据库名字，则会使用Iceberg默认数据库`default`
+- 如果指定Iceberg数据库名字，则该数据库必须已经存在，目前不支持对于不存在的数据库进行自动创建
+- 如果指定的Iceberg表名不存在，则会在 Iceberg 内自动创建对应名字的表
 - `OPTIONS` 参数只有导出模式`mode`生效，其他参数均不生效
 
 举例：

diff --git a/docs/zh/integration/offline_data_sources/index.rst b/docs/zh/integration/offline_data_sources/index.rst
@@ -6,4 +6,5 @@
     :maxdepth: 1
 
     hive
-    s3
+    s3
+    iceberg
diff --git a/java/openmldb-batch/src/test/scala/com/_4paradigm/openmldb/batch/TestLoadDataPlan.scala b/java/openmldb-batch/src/test/scala/com/_4paradigm/openmldb/batch/TestLoadDataPlan.scala
@@ -162,8 +162,8 @@ class TestLoadDataPlan extends SparkTestSuite with Matchers {
       fail("unreachable")
     }
 
-    println("deep load data with invalid format option")
-    a[IllegalArgumentException] should be thrownBy {
+    println("deep load data with invalid format option, catalog will throw exception")
+    a[org.apache.spark.sql.catalyst.parser.ParseException] should be thrownBy {
       openmldbSession.openmldbSql(s"load data infile '$testFileWithHeader' into table $db.$table " +
         "options(format='txt', mode='overwrite');")
       fail("unreachable")
-Original file line number
+Diff line change
@@ Expand Up / @@ -6,4 +6,5 @@ @@
         :maxdepth: 1
         hive
-        s3
+        s3
+        iceberg