Skip to content

Commit

Permalink
merge conflicts
Browse files Browse the repository at this point in the history
  • Loading branch information
ranxianglei.rxl committed Jan 2, 2025
2 parents db912e4 + 18f46f7 commit c7a3776
Show file tree
Hide file tree
Showing 1,184 changed files with 38,637 additions and 10,664 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/publish_snapshot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,6 @@ jobs:
echo "<password>$ASF_PASSWORD</password>" >> $tmp_settings
echo "</server></servers></settings>" >> $tmp_settings
mvn --settings $tmp_settings clean deploy -Dgpg.skip -Drat.skip -DskipTests -Papache-release
mvn --settings $tmp_settings clean deploy -Dgpg.skip -Drat.skip -DskipTests -Papache-release,spark3
rm $tmp_settings
2 changes: 1 addition & 1 deletion .github/workflows/utitcase-jdk11.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ jobs:
jvm_timezone=$(random_timezone)
echo "JVM timezone is set to $jvm_timezone"
test_modules="!paimon-e2e-tests,!org.apache.paimon:paimon-hive-connector-3.1,"
for suffix in 3.5 3.4 3.3 3.2 common; do
for suffix in 3.5 3.4 3.3 3.2 ut; do
test_modules+="!org.apache.paimon:paimon-spark-${suffix},"
done
test_modules="${test_modules%,}"
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/utitcase-spark-3.x.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ jobs:
jvm_timezone=$(random_timezone)
echo "JVM timezone is set to $jvm_timezone"
test_modules=""
for suffix in common_2.12 3.5 3.4 3.3 3.2; do
for suffix in ut 3.5 3.4 3.3 3.2; do
test_modules+="org.apache.paimon:paimon-spark-${suffix},"
done
test_modules="${test_modules%,}"
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/utitcase-spark-4.x.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ jobs:
jvm_timezone=$(random_timezone)
echo "JVM timezone is set to $jvm_timezone"
test_modules=""
for suffix in common_2.13 4.0; do
for suffix in ut 4.0; do
test_modules+="org.apache.paimon:paimon-spark-${suffix},"
done
test_modules="${test_modules%,}"
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/utitcase.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ jobs:
jvm_timezone=$(random_timezone)
echo "JVM timezone is set to $jvm_timezone"
test_modules="!paimon-e2e-tests,"
for suffix in 3.5 3.4 3.3 3.2 common_2.12; do
for suffix in 3.5 3.4 3.3 3.2 ut; do
test_modules+="!org.apache.paimon:paimon-spark-${suffix},"
done
test_modules="${test_modules%,}"
Expand Down
5 changes: 5 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,11 @@ paimon-format/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
paimon-format/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java
from https://parquet.apache.org/ version 1.14.0

paimon-common/src/main/java/org/apache/paimon/data/variant/GenericVariant.java
paimon-common/src/main/java/org/apache/paimon/data/variant/GenericVariantBuilder.java
paimon-common/src/main/java/org/apache/paimon/data/variant/GenericVariantUtil.java
from https://spark.apache.org/ version 4.0.0-preview2

MIT License
-----------

Expand Down
2 changes: 1 addition & 1 deletion docs/content/append-table/bucketed.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,4 +196,4 @@ The `spark.sql.sources.v2.bucketing.enabled` config is used to enable bucketing
Spark will recognize the specific distribution reported by a V2 data source through SupportsReportPartitioning, and
will try to avoid shuffle if necessary.

The costly join shuffle will be avoided if two tables have same bucketing strategy and same number of buckets.
The costly join shuffle will be avoided if two tables have the same bucketing strategy and same number of buckets.
4 changes: 2 additions & 2 deletions docs/content/append-table/query-performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ filtering, if the filtering effect is good, the query would have been minutes of
milliseconds to complete the execution.

Often the data distribution is not always effective filtering, so if we can sort the data by the field in `WHERE` condition?
You can take a look to [Flink COMPACT Action]({{< ref "maintenance/dedicated-compaction#sort-compact" >}}) or
You can take a look at [Flink COMPACT Action]({{< ref "maintenance/dedicated-compaction#sort-compact" >}}) or
[Flink COMPACT Procedure]({{< ref "flink/procedures" >}}) or [Spark COMPACT Procedure]({{< ref "spark/procedures" >}}).

## Data Skipping By File Index
Expand All @@ -54,7 +54,7 @@ file is too small, it will be stored directly in the manifest, otherwise in the
corresponds to an index file, which has a separate file definition and can contain different types of indexes with
multiple columns.

Different file index may be efficient in different scenario. For example bloom filter may speed up query in point lookup
Different file indexes may be efficient in different scenarios. For example bloom filter may speed up query in point lookup
scenario. Using a bitmap may consume more space but can result in greater accuracy.

`Bloom Filter`:
Expand Down
5 changes: 5 additions & 0 deletions docs/content/cdc-ingestion/kafka-cdc.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,10 +198,15 @@ To use this feature through `flink run`, run the following shell command.
kafka_sync_database
--warehouse <warehouse-path> \
--database <database-name> \
[--table_mapping <table-name>=<paimon-table-name>] \
[--table_prefix <paimon-table-prefix>] \
[--table_suffix <paimon-table-suffix>] \
[--table_prefix_db <paimon-table-prefix-by-db>] \
[--table_suffix_db <paimon-table-suffix-by-db>] \
[--including_tables <table-name|name-regular-expr>] \
[--excluding_tables <table-name|name-regular-expr>] \
[--including_dbs <database-name|name-regular-expr>] \
[--excluding_dbs <database-name|name-regular-expr>] \
[--type_mapping to-string] \
[--partition_keys <partition_keys>] \
[--primary_keys <primary-keys>] \
Expand Down
2 changes: 1 addition & 1 deletion docs/content/cdc-ingestion/mysql-cdc.md
Original file line number Diff line number Diff line change
Expand Up @@ -261,5 +261,5 @@ to avoid potential name conflict.
## FAQ
1. Chinese characters in records ingested from MySQL are garbled.
* Try to set `env.java.opts: -Dfile.encoding=UTF-8` in `flink-conf.yaml`
* Try to set `env.java.opts: -Dfile.encoding=UTF-8` in `flink-conf.yaml`(Flink version < 1.19) or `config.yaml`(Flink version >= 1.19)
(the option is changed to `env.java.opts.all` since Flink-1.17).
179 changes: 179 additions & 0 deletions docs/content/concepts/data-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
---
title: "Data Types"
weight: 7
type: docs
aliases:
- /concepts/data-types.html
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Data Types

A data type describes the logical type of a value in the table ecosystem. It can be used to declare input and/or output types of operations.

All data types supported by Paimon are as follows:

<table class="table table-bordered">
<thead>
<tr>
<th class="text-left" style="width: 10%">DataType</th>
<th class="text-left" style="width: 30%">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>BOOLEAN</code></td>
<td><code>Data type of a boolean with a (possibly) three-valued logic of TRUE, FALSE, and UNKNOWN.</code></td>
</tr>
<tr>
<td><code>CHAR</code><br>
<code>CHAR(n)</code>
</td>
<td><code>Data type of a fixed-length character string.</code><br><br>
<code>The type can be declared using CHAR(n) where n is the number of code points. n must have a value between 1 and 2,147,483,647 (both inclusive). If no length is specified, n is equal to 1. </code>
</td>
</tr>
<tr>
<td><code>VARCHAR</code><br>
<code>VARCHAR(n)</code><br><br>
<code>STRING</code>
</td>
<td><code>Data type of a variable-length character string.</code><br><br>
<code>The type can be declared using VARCHAR(n) where n is the maximum number of code points. n must have a value between 1 and 2,147,483,647 (both inclusive). If no length is specified, n is equal to 1. </code><br><br>
<code>STRING is a synonym for VARCHAR(2147483647).</code>
</td>
</tr>
<tr>
<td><code>BINARY</code><br>
<code>BINARY(n)</code><br><br>
</td>
<td><code>Data type of a fixed-length binary string (=a sequence of bytes).</code><br><br>
<code>The type can be declared using BINARY(n) where n is the number of bytes. n must have a value between 1 and 2,147,483,647 (both inclusive). If no length is specified, n is equal to 1.</code>
</td>
</tr>
<tr>
<td><code>VARBINARY</code><br>
<code>VARBINARY(n)</code><br><br>
<code>BYTES</code>
</td>
<td><code>Data type of a variable-length binary string (=a sequence of bytes).</code><br><br>
<code>The type can be declared using VARBINARY(n) where n is the maximum number of bytes. n must have a value between 1 and 2,147,483,647 (both inclusive). If no length is specified, n is equal to 1.</code><br><br>
<code>BYTES is a synonym for VARBINARY(2147483647).</code>
</td>
</tr>
<tr>
<td><code>DECIMAL</code><br>
<code>DECIMAL(p)</code><br>
<code>DECIMAL(p, s)</code>
</td>
<td><code>Data type of a decimal number with fixed precision and scale.</code><br><br>
<code>The type can be declared using DECIMAL(p, s) where p is the number of digits in a number (precision) and s is the number of digits to the right of the decimal point in a number (scale). p must have a value between 1 and 38 (both inclusive). s must have a value between 0 and p (both inclusive). The default value for p is 10. The default value for s is 0.</code>
</td>
</tr>
<tr>
<td><code>TINYINT</code></td>
<td><code>Data type of a 1-byte signed integer with values from -128 to 127.</code></td>
</tr>
<tr>
<td><code>SMALLINT</code></td>
<td><code>Data type of a 2-byte signed integer with values from -32,768 to 32,767.</code></td>
</tr>
<tr>
<td><code>INT</code></td>
<td><code>Data type of a 4-byte signed integer with values from -2,147,483,648 to 2,147,483,647.</code></td>
</tr>
<tr>
<td><code>BIGINT</code></td>
<td><code>Data type of an 8-byte signed integer with values from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.</code></td>
</tr>
<tr>
<td><code>FLOAT</code></td>
<td><code>Data type of a 4-byte single precision floating point number.</code><br><br>
<code>Compared to the SQL standard, the type does not take parameters.</code>
</td>
</tr>
<tr>
<td><code>DOUBLE</code></td>
<td><code>Data type of an 8-byte double precision floating point number.</code></td>
</tr>
<tr>
<td><code>DATE</code></td>
<td><code>Data type of a date consisting of year-month-day with values ranging from 0000-01-01 to 9999-12-31.</code><br><br>
<code>Compared to the SQL standard, the range starts at year 0000.</code>
</td>
</tr>
<tr>
<td><code>TIME</code><br>
<code>TIME(p)</code>
</td>
<td><code>Data type of a time without time zone consisting of hour:minute:second[.fractional] with up to nanosecond precision and values ranging from 00:00:00.000000000 to 23:59:59.999999999.</code><br><br>
<code>The type can be declared using TIME(p) where p is the number of digits of fractional seconds (precision). p must have a value between 0 and 9 (both inclusive). If no precision is specified, p is equal to 0.</code>
</td>
</tr>
<tr>
<td><code>TIMESTAMP</code><br>
<code>TIMESTAMP(p)</code>
</td>
<td><code>Data type of a timestamp without time zone consisting of year-month-day hour:minute:second[.fractional] with up to nanosecond precision and values ranging from 0000-01-01 00:00:00.000000000 to 9999-12-31 23:59:59.999999999.</code><br><br>
<code>The type can be declared using TIMESTAMP(p) where p is the number of digits of fractional seconds (precision). p must have a value between 0 and 9 (both inclusive). If no precision is specified, p is equal to 6.</code>
</td>
</tr>
<tr>
<td><code>TIMESTAMP WITH TIME ZONE</code><br>
<code>TIMESTAMP(p) WITH TIME ZONE</code>
</td>
<td><code>Data type of a timestamp with time zone consisting of year-month-day hour:minute:second[.fractional] zone with up to nanosecond precision and values ranging from 0000-01-01 00:00:00.000000000 +14:59 to 9999-12-31 23:59:59.999999999 -14:59.</code><br><br>
<code>This type fills the gap between time zone free and time zone mandatory timestamp types by allowing the interpretation of UTC timestamps according to the configured session time zone. A conversion from and to int describes the number of seconds since epoch. A conversion from and to long describes the number of milliseconds since epoch.</code>
</td>
</tr>
<tr>
<td><code>ARRAY&lt;t&gt;</code></td>
<td><code>Data type of an array of elements with same subtype.</code><br><br>
<code>Compared to the SQL standard, the maximum cardinality of an array cannot be specified but is fixed at 2,147,483,647. Also, any valid type is supported as a subtype.</code><br><br>
<code>The type can be declared using ARRAY&lt;t&gt; where t is the data type of the contained elements.</code>
</td>
</tr>
<tr>
<td><code>MAP&lt;kt, vt&gt;</code></td>
<td><code>Data type of an associative array that maps keys (including NULL) to values (including NULL). A map cannot contain duplicate keys; each key can map to at most one value.</code><br><br>
<code>There is no restriction of element types; it is the responsibility of the user to ensure uniqueness.</code><br><br>
<code>The type can be declared using MAP&lt;kt, vt&gt; where kt is the data type of the key elements and vt is the data type of the value elements.</code>
</td>
</tr>
<tr>
<td><code>MULTISET&lt;t&gt;</code></td>
<td><code>Data type of a multiset (=bag). Unlike a set, it allows for multiple instances for each of its elements with a common subtype. Each unique value (including NULL) is mapped to some multiplicity.</code><br><br>
<code>There is no restriction of element types; it is the responsibility of the user to ensure uniqueness.</code><br><br>
<code>The type can be declared using MULTISET&lt;t&gt; where t is the data type of the contained elements.</code>
</td>
</tr>
<tr>
<td><code>ROW&lt;n0 t0, n1 t1, ...&gt;</code><br>
<code>ROW&lt;n0 t0 'd0', n1 t1 'd1', ...&gt;</code>
</td>
<td><code>Data type of a sequence of fields.</code><br><br>
<code>A field consists of a field name, field type, and an optional description. The most specific type of a row of a table is a row type. In this case, each column of the row corresponds to the field of the row type that has the same ordinal position as the column.</code><br><br>
<code>Compared to the SQL standard, an optional field description simplifies the handling with complex structures.</code><br><br>
<code>A row type is similar to the STRUCT type known from other non-standard-compliant frameworks.</code><br><br>
<code>The type can be declared using ROW&lt;n0 t0 'd0', n1 t1 'd1', ...&gt; where n is the unique name of a field, t is the logical type of a field, d is the description of a field.</code>
</td>
</tr>
</tbody>
</table>
2 changes: 1 addition & 1 deletion docs/content/concepts/spec/_index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Specification
bookCollapseSection: true
weight: 6
weight: 8
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
Expand Down
40 changes: 37 additions & 3 deletions docs/content/concepts/spec/datafile.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,11 +83,45 @@ relationship between various table types and buckets in Paimon:
The name of data file is `data-${uuid}-${id}.${format}`. For the append table, the file stores the data of the table
without adding any new columns. But for the primary key table, each row of data stores additional system columns:

1. `_VALUE_KIND`: row is deleted or added. Similar to RocksDB, each row of data can be deleted or added, which will be
## Table with Primary key Data File

1. Primary key columns, `_KEY_` prefix to key columns, this is to avoid conflicts with columns of the table. It's optional,
Paimon version 1.0 and above will retrieve the primary key fields from value_columns.
2. `_VALUE_KIND`: TINYINT, row is deleted or added. Similar to RocksDB, each row of data can be deleted or added, which will be
used for updating the primary key table.
2. `_SEQUENCE_NUMBER`: this number is used for comparison during updates, determining which data came first and which
3. `_SEQUENCE_NUMBER`: BIGINT, this number is used for comparison during updates, determining which data came first and which
data came later.
3. `_KEY_` prefix to key columns, this is to avoid conflicts with columns of the table.
4. Value columns. All columns declared in the table.

For example, data file for table:

```sql
CREATE TABLE T (
a INT PRIMARY KEY NOT ENFORCED,
b INT,
c INT
);
```

Its file has 6 columns: `_KEY_a`, `_VALUE_KIND`, `_SEQUENCE_NUMBER`, `a`, `b`, `c`.

When `data-file.thin-mode` enabled, its file has 5 columns: `_VALUE_KIND`, `_SEQUENCE_NUMBER`, `a`, `b`, `c`.

## Table w/o Primary key Data File

- Value columns. All columns declared in the table.

For example, data file for table:

```sql
CREATE TABLE T (
a INT,
b INT,
c INT
);
```

Its file has 3 columns: `a`, `b`, `c`.

## Changelog File

Expand Down
28 changes: 26 additions & 2 deletions docs/content/concepts/spec/fileindex.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,11 +154,35 @@ BSI file index format (V1)
+-------------------------------------------------+
| has positive value (1 byte) |
+-------------------------------------------------+
| positive bsi serialized (if has positive value)|
| positive BSI serialized (if has positive value)|
+-------------------------------------------------+
| has negative value (1 byte) |
+-------------------------------------------------+
| negative bsi serialized (if has negative value)|
| negative BSI serialized (if has negative value)|
+-------------------------------------------------+
</pre>

BSI serialized format (V1):
<pre>
BSI serialized format (V1)
+-------------------------------------------------+
| version (1 byte) |
+-------------------------------------------------+
| min value (8 bytes long) |
+-------------------------------------------------+
| max value (8 bytes long) |
+-------------------------------------------------+
| serialized existence bitmap |
+-------------------------------------------------+
| bit slice bitmap count (4 bytes int) |
+-------------------------------------------------+
| serialized bit 0 bitmap |
+-------------------------------------------------+
| serialized bit 1 bitmap |
+-------------------------------------------------+
| serialized bit 2 bitmap |
+-------------------------------------------------+
| ... |
+-------------------------------------------------+
</pre>

Expand Down
Loading

0 comments on commit c7a3776

Please sign in to comment.