Skip to content

Commit

Permalink
docs
Browse files Browse the repository at this point in the history
  • Loading branch information
vagetablechicken committed Jan 30, 2024
1 parent 95502fa commit 6989179
Show file tree
Hide file tree
Showing 4 changed files with 143 additions and 37 deletions.
2 changes: 1 addition & 1 deletion docs/en/integration/offline_data_sources/hive.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ Importing data from Hive sources is facilitated through the API [`LOAD DATA INFI

- Both offline and online engines are capable of importing data from Hive sources.
- The Hive data import feature supports soft connections. This approach minimizes the need for redundant data copies and ensures that OpenMLDB can access Hive's most up-to-date data at any given time. To activate the soft link mechanism for data import, utilize the `deep_copy=false` parameter.
- The `OPTIONS` parameter offers two valid settings: `deep_copy`, `mode` and `sql`.
- The `OPTIONS` parameter offers three valid settings: `deep_copy`, `mode` and `sql`.

For example:

Expand Down
70 changes: 37 additions & 33 deletions docs/en/integration/offline_data_sources/iceberg.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

## Introduction

[Apache Iceberg](https://iceberg.apache.org/) is a table format that offers a host of features, including schema evolution, partitioning, and metadata management. OpenMLDB supports the use of Iceberg as an offline storage engine for reading and exporting feature computation data
[Apache Iceberg](https://iceberg.apache.org/) is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, Hive and Impala using a high-performance table format that works just like a SQL table. OpenMLDB supports the use of Iceberg as an offline storage engine for importing data and exporting feature computation data.

## Configuration

### Installation

For users employing [The OpenMLDB Spark Distribution Version](../../tutorial/openmldbspark_distribution.md), specifically v0.8.5 and newer iterations, the essential Iceberg dependencies are already integrated. If you are working with an alternative Spark distribution or different iceberg version, you can download the corresponding Iceberg dependencies from the [Iceberg release](https://iceberg.apache.org/releases/) and add them to the Spark classpath/jars. For example, if you are using OpenMLDB Spark, you should download `x.x.x Spark 3.2_12 runtime Jar`(x.x.x is iceberg version) and add it to `jars/` in Spark home.
For users employing [The OpenMLDB Spark Distribution Version](../../tutorial/openmldbspark_distribution.md), specifically v0.8.5 and newer iterations, the essential Iceberg 1.4.3 dependencies are already integrated. If you are working with an alternative Spark distribution or different iceberg version, you can download the corresponding Iceberg dependencies from the [Iceberg release](https://iceberg.apache.org/releases/) and add them to the Spark classpath/jars. For example, if you are using OpenMLDB Spark, you should download `x.x.x Spark 3.2_12 runtime Jar`(x.x.x is iceberg version) and add it to `jars/` in Spark home.

### Configuration

Expand All @@ -25,58 +25,62 @@ For example, set hive catalog in `taskmanager.properties(.template)`:
spark.default.conf=spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog;spark.sql.catalog.hive_prod.type=hive;spark.sql.catalog.hive_prod.uri=thrift://metastore-host:port
```

### Debug Information

If you need to create iceberg tables, you also need to configure `spark.sql.catalog.hive_prod.warehouse`.

Set hadoop catalog:

## Data Format

Currently, it only supports the following Hive data format:
```properties
spark.default.conf=spark.sql.catalog.hadoop_prod=org.apache.iceberg.hadoop.HadoopCatalog;spark.sql.catalog.hadoop_prod.type=hadoop;spark.sql.catalog.hadoop_prod.warehouse=hdfs://hadoop-namenode:port/warehouse
```

| OpenMLDB Data Format | Hive Data Format |
| -------------------- | ---------------- |
| BOOL | BOOL |
| SMALLINT | SMALLINT |
| INT | INT |
| BIGINT | BIGINT |
| FLOAT | FLOAT |
| DOUBLE | DOUBLE |
| DATE | DATE |
| TIMESTAMP | TIMESTAMP |
| STRING | STRING |
Set rest catalog:

## Quickly Create Tables Through the `LIKE` Syntax TODO
```properties
spark.default.conf=spark.sql.catalog.rest_prod=org.apache.iceberg.spark.SparkCatalog;spark.sql.catalog.rest_prod.catalog-impl=org.apache.iceberg.rest.RESTCatalog;spark.sql.catalog.rest_prod.uri=http://iceberg-rest:8181/
```

We offer the convenience of utilizing the `LIKE` syntax to facilitate the creation of tables with identical schemas in OpenMLDB, leveraging existing Hive tables. This is demonstrated in the example below.
The full configuration of the iceberg catalog see [Iceberg Catalog Configuration](https://iceberg.apache.org/docs/latest/spark-configuration/).

### Debug Information

```sql
CREATE TABLE db1.t1 LIKE HIVE 'hive://hive_db.t1';
-- SUCCEED
When you import data from Iceberg, you can check the task log to confirm whether the task read the source data.
```
INFO ReaderImpl: Reading ORC rows from
```
TODO

## Data Format

It's worth noting that there are certain known issues associated with using the `LIKE` syntax for creating tables based on Hive shortcuts:
Iceberg schema see [Iceberg Schema](https://iceberg.apache.org/spec/#schema). Currently, it only supports the following Iceberg data format:

- When employing the default timeout configuration via the command line, the table creation process might exhibit a timeout message despite the execution being successful. The final outcome can be verified by utilizing the `SHOW TABLES` command. If you need to adjust the timeout duration, refer to [Adjusting Configuration](../../openmldb_sql/ddl/SET_STATEMENT.md#offline-commands-configuration-details).
- Should the Hive table contain column constraints (such as `NOT NULL`), these particular constraints won't be incorporated into the newly created table.
| OpenMLDB Data Format | Iceberg Data Format |
| -------------------- | ------------------- |
| BOOL | bool |
| INT | int |
| BIGINT | long |
| FLOAT | float |
| DOUBLE | double |
| DATE | date |
| TIMESTAMP | timestamp |
| STRING | string |

## Import Iceberg Data to OpenMLDB

Importing data from Iceberg sources is facilitated through the API [`LOAD DATA INFILE`](../../openmldb_sql/dml/LOAD_DATA_STATEMENT.md). This operation employs a specialized URI format, `hive://[db].table`, to seamlessly import data from Hive. Here are some important considerations:
Importing data from Iceberg sources is facilitated through the API [`LOAD DATA INFILE`](../../openmldb_sql/dml/LOAD_DATA_STATEMENT.md). This operation employs a specialized URI format, `hive://[db].table`, to seamlessly import data from Iceberg. Here are some important considerations:

- Both offline and online engines are capable of importing data from Hive sources.
- The Hive data import feature supports soft connections. This approach minimizes the need for redundant data copies and ensures that OpenMLDB can access Hive's most up-to-date data at any given time. To activate the soft link mechanism for data import, utilize the `deep_copy=false` parameter.
- The `OPTIONS` parameter offers two valid settings: `deep_copy`, `mode` and `sql`.
- Both offline and online engines are capable of importing data from Iceberg sources.
- The Iceberg data import feature supports soft connections. This approach minimizes the need for redundant data copies and ensures that OpenMLDB can access Iceberg's most up-to-date data at any given time. To activate the soft link mechanism for data import, utilize the `deep_copy=false` parameter.
- The `OPTIONS` parameter offers three valid settings: `deep_copy`, `mode` and `sql`.

For example, load data from iceberg configured as hive catalog:
For example, load data from Iceberg configured as hive catalog:

```sql
LOAD DATA INFILE 'iceberg://hive_prod.db1.t1' INTO TABLE t1 OPTIONS(deep_copy=false);
-- or
LOAD DATA INFILE 'hive_prod.db1.t1' INTO TABLE t1 OPTIONS(deep_copy=false, format='iceberg');
```

The data loading process also supports using SQL queries to filter specific data from Hive tables. It's important to note that the SQL syntax must comply with SparkSQL standards. The table name used should be the registered name without the `hive://` prefix.
The data loading process also supports using SQL queries to filter specific data from Hive tables. It's important to note that the SQL syntax must comply with SparkSQL standards. The table name used should be the registered name without the `iceberg://` prefix.

For example:

Expand All @@ -86,7 +90,7 @@ LOAD DATA INFILE 'iceberg://hive_prod.db1.t1' INTO TABLE db1.t1 OPTIONS(deep_cop

## Export OpenMLDB Data to Iceberg

Exporting data to Hive sources is facilitated through the API [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md), which employs a distinct URI format, `iceberg://[catalog].[db].table`, to seamlessly transfer data to the Hive data warehouse. Here are some key considerations:
Exporting data to Hive sources is facilitated through the API [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md), which employs a distinct URI format, `iceberg://[catalog].[db].table`, to seamlessly transfer data to the Iceberg data warehouse. Here are some key considerations:

- If you omit specifying a database name, the default database name used will be `default_Db`. TODO?
- When a database name is explicitly provided, it's imperative that the database already exists. Currently, the system does not support the automatic creation of non-existent databases.
Expand Down
6 changes: 3 additions & 3 deletions docs/zh/integration/offline_data_sources/hive.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,14 @@

### 配置

目前 OpenMLDB 只支持使用 metastore 服务来连接Hive。你可以在以下两种配置方式中选择一种,来访问 Hive 数据源。
目前 OpenMLDB 只支持使用 metastore 服务来连接Hive。你可以在以下两种配置方式中选择一种,来访问 Hive 数据源。测试搭建的HIVE环境简单,通常只需要配置`hive.metastore.uris`即可。但生产环境中,可能需要配置更多的Hive配置,更推荐使用`hive-site.xml`的方式。

- spark.conf:你可以在 spark conf 中配置 `spark.hadoop.hive.metastore.uris`。有两种方式:
- spark.conf:你可以在 spark conf 中配置 `spark.hadoop.hive.metastore.uris`等相关配置。有两种方式:

- taskmanager.properties: 在配置项 `spark.default.conf` 中加入`spark.hadoop.hive.metastore.uris=thrift://...` ,随后重启taskmanager。
- CLI: 在 ini conf 中加入此配置项,并使用`--spark_conf`启动CLI,参考[客户端Spark配置文件](../../reference/client_config/client_spark_config.md)

- hive-site.xml:你可以配置 `hive-site.xml` 中的 `hive.metastore.uris`,并将配置文件放入 Spark home的`conf/`(如果已配置`HADOOP_CONF_DIR`环境变量,也可以将配置文件放入`HADOOP_CONF_DIR`中)。`hive-site.xml` 样例:
- hive-site.xml:你将HIVE的配置 `hive-site.xml` 放入 Spark home的`conf/`(如果已配置`HADOOP_CONF_DIR`环境变量,也可以将配置文件放入`HADOOP_CONF_DIR`中)。`hive-site.xml` 样例:

```xml
<configuration>
Expand Down
102 changes: 102 additions & 0 deletions docs/zh/integration/offline_data_sources/iceberg.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Iceberg

## 简介

[Apache Iceberg](https://iceberg.apache.org/) 是一个开源的大数据表格格式。Iceberg可以在Spark、Trino、PrestoDB、Flink、Hive和Impala等计算引擎中添加表格,使用高性能的表格格式,就像SQL表格一样。OpenMLDB 支持使用 Iceberg 作为离线存储引擎,导入数据和导出特征计算数据。

## 配置

### 安装

[OpenMLDB Spark 发行版](../../tutorial/openmldbspark_distribution.md) v0.8.5 及以上版本均已经包含 Iceberg 1.4.3 依赖。如果你需要与其他iceberg版本或者其他Spark发行版一起使用,你可以从[Iceberg release](https://iceberg.apache.org/releases/)下载对应的Iceberg依赖,并将其添加到Spark的classpath/jars中。例如,如果你使用的是OpenMLDB Spark,你应该下载`x.x.x Spark 3.2_12 runtime Jar`(x.x.x is iceberg version)并将其添加到Spark home的`jars/`中。

### 配置

你需要将catalog配置添加到Spark配置中。有两种方式:

- taskmanager.properties(.template): 在配置项 `spark.default.conf` 中加入Iceberg配置,随后重启taskmanager。
- CLI: 在 ini conf 中加入此配置项,并使用`--spark_conf`启动CLI,参考[客户端Spark配置文件](../../reference/client_config/client_spark_config.md)

Iceberg配置详情参考[Iceberg Configuration](https://iceberg.apache.org/docs/latest/spark-configuration/)

例如,在`taskmanager.properties(.template)`中设置hive catalog:

```properties
spark.default.conf=spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog;spark.sql.catalog.hive_prod.type=hive;spark.sql.catalog.hive_prod.uri=thrift://metastore-host:port
```

如果需要创建iceberg表,还需要配置`spark.sql.catalog.hive_prod.warehouse`

设置 hadoop catalog:

```properties
spark.default.conf=spark.sql.catalog.hadoop_prod=org.apache.iceberg.hadoop.HadoopCatalog;spark.sql.catalog.hadoop_prod.type=hadoop;spark.sql.catalog.hadoop_prod.warehouse=hdfs://hadoop-namenode:port/warehouse
```

设置 rest catalog:

```properties
spark.default.conf=spark.sql.catalog.rest_prod=org.apache.iceberg.spark.SparkCatalog;spark.sql.catalog.rest_prod.catalog-impl=org.apache.iceberg.rest.RESTCatalog;spark.sql.catalog.rest_prod.uri=http://iceberg-rest:8181/
```

Iceberg catalog的完整配置参考[Iceberg Catalog Configuration](https://iceberg.apache.org/docs/latest/spark-configuration/)

### 调试信息

当你从Iceberg导入数据时,你可以检查任务日志,确认任务是否读取了源数据。

```
INFO ReaderImpl: Reading ORC rows from
```

## 数据格式

Iceberg schema参考[Iceberg Schema](https://iceberg.apache.org/spec/#schema)。目前,仅支持以下Iceberg数据格式:

| OpenMLDB 数据格式 | Iceberg 数据格式 |
| ----------------- | ---------------- |
| BOOL | bool |
| INT | int |
| BIGINT | long |
| FLOAT | float |
| DOUBLE | double |
| DATE | date |
| TIMESTAMP | timestamp |
| STRING | string |

## 导入 Iceberg 数据到 OpenMLDB

从 Iceberg 表导入数据,需要使用 [`LOAD DATA INFILE`](../../openmldb_sql/dml/LOAD_DATA_STATEMENT.md) 语句。这个语句使用特殊的 URI 格式 `hive://[db].table`,可以无缝地从 Iceberg 导入数据。以下是一些重要的注意事项:

- 离线引擎和在线引擎都可以从 Iceberg 表导入数据。
- 离线导入支持软链接,但是在线导入不支持软链接。使用软链接时,需要在导入OPTIONS中指定 `deep_copy=false`
- Iceberg 表导入只有三个参数有效: `deep_copy`, `mode` and `sql`。其他格式参数`delimiter``quote`等均无效。

例如,通过Iceberg Hive Catalog导入数据:

```sql
LOAD DATA INFILE 'iceberg://hive_prod.db1.t1' INTO TABLE t1 OPTIONS(deep_copy=false);
-- or
LOAD DATA INFILE 'hive_prod.db1.t1' INTO TABLE t1 OPTIONS(deep_copy=false, format='iceberg');
```

数据导入支持`sql`参数,筛选出表种的特定数据进行导入,注意 SQL 必须符合 SparkSQL 语法,数据表为注册后的表名,不带 `iceberg://` 前缀。

```sql
LOAD DATA INFILE 'iceberg://hive_prod.db1.t1' INTO TABLE t1 OPTIONS(deep_copy=false, sql='select * from t1 where id > 100');
```

## 导出 OpenMLDB 数据到 Iceberg

从 OpenMLDB 导出数据到 Iceberg 表,需要使用 [`SELECT INTO`](../../openmldb_sql/dql/SELECT_INTO_STATEMENT.md) 语句,这个语句使用特殊的 URI 格式 `iceberg://[db].table`,可以无缝地导出数据到 Iceberg 表。以下是一些重要的注意事项:

- 如果不指定数据库名字,则会使用默认数据库名字 `default_db` TODO
- 如果指定数据库名字,则该数据库必须已经存在,目前不支持对于不存在的数据库进行自动创建
- 如果指定的Hive表名不存在,则会在 Hive 内自动创建对应名字的表
- `OPTIONS` 参数只有导出模式`mode`生效,其他参数均不生效

举例:

```sql
SELECT col1, col2, col3 FROM t1 INTO OUTFILE 'iceberg://hive_prod.db1.t1';
```

0 comments on commit 6989179

Please sign in to comment.