Skip to content

Commit

Permalink
fix
Browse files Browse the repository at this point in the history
  • Loading branch information
vagetablechicken committed Jan 30, 2024
1 parent 6989179 commit febfd4d
Show file tree
Hide file tree
Showing 6 changed files with 46 additions and 15 deletions.
2 changes: 1 addition & 1 deletion docs/en/integration/offline_data_sources/hive.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ LOAD DATA INFILE 'hive://db1.t1' INTO TABLE db1.t1 OPTIONS(deep_copy=true, sql='

Exporting data to Hive sources is facilitated through the API [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md), which employs a distinct URI format, `hive://[db].table`, to seamlessly transfer data to the Hive data warehouse. Here are some key considerations:

- If you omit specifying a database name, the default database name used will be `default_Db`.
- If you omit specifying Hive database name, the default database used in Hive will be `default`.
- When a database name is explicitly provided, it's imperative that the database already exists. Currently, the system does not support the automatic creation of non-existent databases.
- In the event that the designated Hive table name is absent, the system will automatically generate a table with the corresponding name within the Hive environment.
- The `OPTIONS` parameter exclusively takes effect within the export mode of `mode`. Other parameters do not exert any influence.
Expand Down
8 changes: 4 additions & 4 deletions docs/en/integration/offline_data_sources/iceberg.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,11 +90,11 @@ LOAD DATA INFILE 'iceberg://hive_prod.db1.t1' INTO TABLE db1.t1 OPTIONS(deep_cop

## Export OpenMLDB Data to Iceberg

Exporting data to Hive sources is facilitated through the API [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md), which employs a distinct URI format, `iceberg://[catalog].[db].table`, to seamlessly transfer data to the Iceberg data warehouse. Here are some key considerations:
Exporting data to Iceberg sources is facilitated through the API [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md), which employs a distinct URI format, `iceberg://[catalog].[db].table`, to seamlessly transfer data to the Iceberg data warehouse. Here are some key considerations:

- If you omit specifying a database name, the default database name used will be `default_Db`. TODO?
- When a database name is explicitly provided, it's imperative that the database already exists. Currently, the system does not support the automatic creation of non-existent databases.
- In the event that the designated Hive table name is absent, the system will automatically generate a table with the corresponding name within the Hive environment.
- If you omit specifying Iceberg database name, the default database used in Iceberg will be `default`.
- When Iceberg database name is explicitly provided, it's imperative that the database already exists. Currently, the system does not support the automatic creation of non-existent databases.
- In the event that the designated Iceberg table name is absent, the system will automatically generate a table with the corresponding name within the Hive environment.
- The `OPTIONS` parameter exclusively takes effect within the export mode of `mode`. Other parameters do not exert any influence.

For example:
Expand Down
4 changes: 2 additions & 2 deletions docs/zh/integration/offline_data_sources/hive.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
- taskmanager.properties: 在配置项 `spark.default.conf` 中加入`spark.hadoop.hive.metastore.uris=thrift://...` ,随后重启taskmanager。
- CLI: 在 ini conf 中加入此配置项,并使用`--spark_conf`启动CLI,参考[客户端Spark配置文件](../../reference/client_config/client_spark_config.md)

- hive-site.xml:你将HIVE的配置 `hive-site.xml` 放入 Spark home的`conf/`(如果已配置`HADOOP_CONF_DIR`环境变量,也可以将配置文件放入`HADOOP_CONF_DIR`中)。`hive-site.xml` 样例:
- hive-site.xml:你可以将HIVE的配置 `hive-site.xml` 放入 Spark home的`conf/`(如果已配置`HADOOP_CONF_DIR`环境变量,也可以将配置文件放入`HADOOP_CONF_DIR`中)。`hive-site.xml` 样例:

```xml
<configuration>
Expand Down Expand Up @@ -122,7 +122,7 @@ LOAD DATA INFILE 'hive://db1.t1' INTO TABLE db1.t1 OPTIONS(deep_copy=true, sql='

对于 Hive 数据源的导出是通过 API [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md) 进行支持,通过使用特定的 URI 接口 `hive://[db].table` 的格式进行导出到 Hive 数仓。注意:

- 如果不指定数据库名字,则会使用默认数据库名字 `default_db`
- 如果不指定Hive数据库名字,则会使用Hive默认数据库 `default`
- 如果指定数据库名字,则该数据库必须已经存在,目前不支持对于不存在的数据库进行自动创建
- 如果指定的Hive表名不存在,则会在 Hive 内自动创建对应名字的表
- `OPTIONS` 参数只有导出模式`mode`生效,其他参数均不生效
Expand Down
40 changes: 35 additions & 5 deletions docs/zh/integration/offline_data_sources/iceberg.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,42 @@ spark.default.conf=spark.sql.catalog.rest_prod=org.apache.iceberg.spark.SparkCat

Iceberg catalog的完整配置参考[Iceberg Catalog Configuration](https://iceberg.apache.org/docs/latest/spark-configuration/)

任一配置成功后,均使用`<catalog_name>.<db_name>.<table_name>`的格式访问Iceberg表。如果不想使用`<catalog_name>`,可以在配置中设置`spark.sql.catalog.default=<catalog_name>`。也可添加`spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog``spark.sql.catalog.spark_catalog.type=hive`,让iceberg catalog合入spark catalog中(非iceberg表仍然存在于spark catalog中),这样可以使用`<db_name>.<table_name>`的格式访问Iceberg表。

### 调试信息

当你从Iceberg导入数据时,你可以检查任务日志,确认任务是否读取了源数据。
成功连接Iceberg Hive Catalog后,你可以在日志中看到类似以下的信息:

```
24/01/30 09:01:05 INFO SharedState: Setting hive.metastore.warehouse.dir ('hdfs://namenode:19000/user/hive/warehouse') to the value of spark.sql.warehouse.dir.
24/01/30 09:01:05 INFO SharedState: Warehouse path is 'hdfs://namenode:19000/user/hive/warehouse'.
...
24/01/30 09:01:06 INFO HiveUtils: Initializing HiveMetastoreConnection version 2.3.9 using Spark classes.
24/01/30 09:01:06 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.9) is hdfs://namenode:19000/user/hive/warehouse
24/01/30 09:01:06 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/01/30 09:01:06 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
24/01/30 09:01:06 INFO HiveMetaStore: 0: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore
24/01/30 09:01:06 INFO ObjectStore: ObjectStore, initialize called
24/01/30 09:01:06 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
24/01/30 09:01:06 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
24/01/30 09:01:07 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
24/01/30 09:01:07 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is POSTGRES
24/01/30 09:01:07 INFO ObjectStore: Initialized ObjectStore
24/01/30 09:01:08 INFO HiveMetaStore: Added admin role in metastore
24/01/30 09:01:08 INFO HiveMetaStore: Added public role in metastore
24/01/30 09:01:08 INFO HiveMetaStore: No user is added in admin role, since config is empty
24/01/30 09:01:08 INFO HiveMetaStore: 0: get_database: default
```

导出到Iceberg时,你可以检查任务日志,应该有类似以下的信息:

```
INFO ReaderImpl: Reading ORC rows from
24/01/30 09:57:29 INFO AtomicReplaceTableAsSelectExec: Start processing data source write support: IcebergBatchWrite(table=nyc.taxis_out, format=PARQUET). The input RDD has 1 partitions.
...
24/01/30 09:57:31 INFO AtomicReplaceTableAsSelectExec: Data source write support IcebergBatchWrite(table=nyc.taxis_out, format=PARQUET) committed.
...
24/01/30 09:57:31 INFO HiveTableOperations: Committed to table hive_prod.nyc.taxis_out with the new metadata location hdfs://namenode:19000/user/hive/iceberg_storage/nyc.db/taxis_out/metadata/00001-038d8b81-04a6-4a19-bb83-275eb4664937.metadata.json
24/01/30 09:57:31 INFO BaseMetastoreTableOperations: Successfully committed to table hive_prod.nyc.taxis_out in 224 ms
```

## 数据格式
Expand Down Expand Up @@ -90,9 +120,9 @@ LOAD DATA INFILE 'iceberg://hive_prod.db1.t1' INTO TABLE t1 OPTIONS(deep_copy=fa

从 OpenMLDB 导出数据到 Iceberg 表,需要使用 [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md) 语句,这个语句使用特殊的 URI 格式 `iceberg://[db].table`,可以无缝地导出数据到 Iceberg 表。以下是一些重要的注意事项:

- 如果不指定数据库名字,则会使用默认数据库名字 `default_db` TODO
- 如果指定数据库名字,则该数据库必须已经存在,目前不支持对于不存在的数据库进行自动创建
- 如果指定的Hive表名不存在,则会在 Hive 内自动创建对应名字的表
- 如果不指定Iceberg数据库名字,则会使用Iceberg默认数据库`default`
- 如果指定Iceberg数据库名字,则该数据库必须已经存在,目前不支持对于不存在的数据库进行自动创建
- 如果指定的Iceberg表名不存在,则会在 Iceberg 内自动创建对应名字的表
- `OPTIONS` 参数只有导出模式`mode`生效,其他参数均不生效

举例:
Expand Down
3 changes: 2 additions & 1 deletion docs/zh/integration/offline_data_sources/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@
:maxdepth: 1

hive
s3
s3
iceberg
Original file line number Diff line number Diff line change
Expand Up @@ -162,8 +162,8 @@ class TestLoadDataPlan extends SparkTestSuite with Matchers {
fail("unreachable")
}

println("deep load data with invalid format option")
a[IllegalArgumentException] should be thrownBy {
println("deep load data with invalid format option, catalog will throw exception")
a[org.apache.spark.sql.catalyst.parser.ParseException] should be thrownBy {
openmldbSession.openmldbSql(s"load data infile '$testFileWithHeader' into table $db.$table " +
"options(format='txt', mode='overwrite');")
fail("unreachable")
Expand Down

0 comments on commit febfd4d

Please sign in to comment.