Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add history mode guide #49

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from
6 changes: 4 additions & 2 deletions development-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,12 +105,14 @@ This operation should report all columns in the destination table, including Fiv
- This operation might be requested for a table that does not exist in the destination. In that case, it should NOT fail, simply ignore the request and return `success = true`.
- `utc_delete_before` has millisecond precision.

#### WriteBatchRequest
#### WriteBatchRequest
- `replace_files` is for `upsert` operation where the rows should be inserted if they don't exist or updated if they do. Each row will always provide values for all columns. Set the `_fivetran_synced` column in the destination with the values coming in from the csv files.

- `update_files` is for `update` operation where modified columns have actual values whereas unmodified columns have the special value `unmodified_string` in `CsvFileParams`. Soft-deleted rows will arrive in here as well. Update the `_fivetran_synced` column in the destination with the values coming in from the csv files.

- `delete_files` is for `hard delete` operation. Use primary key columns (or `_fivetran_id` system column for primary-keyless tables) to perform `DELETE FROM`.
- `delete_files` is for `hard delete` operation. Use primary key columns (or `_fivetran_id` system column for primary-keyless tables) to perform `DELETE FROM`.

> Note: To handle history mode `replace_files`, `update_files` and `delete_files`, follow [How to Handle History Mode Batch Files](how-to-handle-history-mode-batch-files.md) guide.

Also, Fivetran will deduplicate operations such that each primary key will show up only once in any of the operations

Expand Down
221 changes: 221 additions & 0 deletions how-to-handle-history-mode-batch-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
# What is History Mode

History mode allows to capture every available version of each record from Fivetran source connectors.
In order to keep all versions of the records, three new system columns are added to tables with history mode enabled.


Column | Type | Description
--- | --- | ---
_fivetran_active | Boolean | TRUE if it is the currently active record. FALSE if it is a historical version of the record. Only one version of the record can be TRUE.
_fivetran_start | TimeStamp | The time when the record was first created or modified in the source.
_fivetran_end | TimeStamp | The value for this column depends on whether the record is active. If the record is not active, then `_fivetran_end` value will be `_fivetran_start` of the next version of the record minus 1 millisecond. If the record is deleted, then the value will be the same as the timestamp of delete operation. If the record is active, then `_fivetran_end` is set to maximum TIMESTAMP value.


## Points to remember in history mode

- In `WriterBatchRequest`, we pass a new boolean field, `history_mode`, which indicates if the connector is in history mode or not.
- If the existing table is not empty, then, in the batch file, we also send a boolean column `_fivetran_earliest`. Suppose, in an `upsert`, we got multiple versions of the same record in a flush, then we set the `_fivetran_earliest` column value to `TRUE` for the record which have the earliest `_fivetran_start` and rest of the versions will have `_fivetran_earliest` as FALSE.
- For each `replace`, `update` and `delete` batch file, DELETE the existing records in the destination table with `_fivetran_start` greater than or equal to `_fivetran_start` of matcing records in batch file (refer to [Example 1](#example-1) and [Example 2](#example-2)).

> IMPORTANT: Do not add the `_fivetran_earliest` column to the destination table. It is provided for convenience to easily identify the earliest record and can be used to optimize the data load query.
See the following example of a `replace` batch file in history mode:

Id(PK) | COL1 | _fivetran_start(PK) | _fivetran_end | _fivetran_active | _fivetran_earliest
---|---------|---------------------| --- |------------------| ---
1 | abc | T1 | T2-1 | FALSE | TRUE
2 | xyz | T1 | TMAX | TRUE | TRUE
1 | pqr | T2 | T3-1 | FALSE | FALSE
1 | def | T3 | TMAX | TRUE | FALSE

## How to Handle Replaces, Updates and Deletes

### Replace

#### Example 1
This example describes a case where the `_fivetran_start` column value of the destination row is less than `_fivetran_start` of the matching row in the batch file.

5tran-alexil marked this conversation as resolved.
Show resolved Hide resolved
Suppose the existing table in the destination is as follows:

Id(PK) | COL1 | COL2 | _fivetran_start(PK) | _fivetran_end | _fivetran_active | _fivetran_synced
--- |------|----| --- | --- | --- | ---
1 | abc | 1 |T1 | T2-1 | FALSE | T100
1 | pqr | 2 | T2 | TMAX | TRUE | T101
2 | mno | 3 | T2 | TMAX | TRUE | T103

At the source, new records are added:

Id(PK) | COL1 | COL2 | Timestamp | Type
--- | --- | --- |-----------| ---
1 | def |1 | T3 | Upserted
1 | ghi | 1 | T4 | Upserted

The replace batch file will be as follows:

Id(PK) | COL1 | COL2 | _fivetran_start(PK) | _fivetran_end | _fivetran_active | _fivetran_earliest | _fivetran_synced
--- |------|-------|---------------------| --- | --- | --- | ---
1 | def | 1 | T3 | T4-1 | FALSE | TRUE | T104
1 | ghi | 1| T4 | TMAX | TRUE | FALSE | T105


The final destination table will be as follows:

Id(PK) | COL1 | COL2 | _fivetran_start(PK) | _fivetran_end | _fivetran_active | _fivetran_synced
--- |---|--------|---------------------| --- |------------------| ---
1 | abc | 1 | T1 | T2-1 | FALSE | T100
1 | pqr | 2 | T2 | T3-1 | FALSE | T101
2 | mno | 3 | T3 | TMAX | TRUE | T103
1 | def | 1 |T3 | T4-1 | FALSE | T104
1 | ghi | 1 | T4 | TMAX | TRUE | T105

**Explanation:**
- We got new records for id = 1.
- We check for the corresponding earliest record (`_fivetran_earliest` = TRUE), and delete the existing records from the destination table where the `_fivetran_start` column value is greater than or equal to the `_fivetran_start` column value of the matching rows in batch file.
- We set the value of `_fivetran_end` of the active record in the destination table to `_fivetran_start`-1 of the `_fivetran_earliest` record of the batch file.
- We set the `_fivetran_active` column value for the above updated record to FALSE.
- We insert new records in the destination table _as is_, excluding the `_fivetran_earliest` column.

#### Example 2

This example describes a case where the `_fivetran_start` column value of the destination row is greater than or equal to the `_fivetran_start` of the matching row in the batch file.

5tran-alexil marked this conversation as resolved.
Show resolved Hide resolved
Suppose the existing table in the destination is as follows:

Id(PK) | COL1 | COL2 | _fivetran_start(PK) | _fivetran_end | _fivetran_active | _fivetran_synced
--- |---|--------|---------------------| --- |------------------| ---
1 | xyz | 4 | T1 | T3-1 | FALSE | T100
1 | abc | 1 | T3 | T4-1 | FALSE | T100
1 | pqr | 2 | T4 | TMAX | TRUE | T101
2 | mno | 3 | T4 | TMAX | TRUE | T103

At the source, new records are added:

Id(PK) | COL1 | COL2 | Timestamp | Type
--- | --- | --- | --- | ---
1 | ghi | 1 | T2 | Upserted



The replace batch file will be as follows:

Id(PK) | COL1 | COL2 | _fivetran_start(PK) | _fivetran_end | _fivetran_active | _fivetran_earliest | _fivetran_synced
--- | --- | --- | --- | --- | --- | --- | ---
1 | ghi | 1 | T2 | TMAX | TRUE | TRUE | T104

The final destination table will be as follows:

Id(PK) | COL1 | COL2 | _fivetran_start(PK) | _fivetran_end | _fivetran_active | _fivetran_synced
--- | --- | --- | --- | --- | --- | ---
1 | ghi | 1 | T2 | TMAX | TRUE | T104
1 | xyz | 4 | T1 | T3-1 | FALSE | T100
2 | mno | 3 | T4 | TMAX | TRUE | T103

**Explanation:**
- We got a new record for id = 1.
- We check for the corresponding earliest record(`_fivetran_earliest` = TRUE), and delete existing records from the destination table where `_fivetran_start` of destination row is greater than or equal to the `_fivetran_start` of the matching row in the batch file. In this example, we have such records, so we delete records with id = 1, _fivetran_start = T3 and id = 1, _fivetran_start = T4.
- `_fivetran_end` of the active record in the destination table is set to `_fivetran_start`-1 of the `_fivetran_earliest` record of the batch file.
- We set the `_fivetran_active` column value for the above updated record to FALSE.
- We insert new records _as is_ excluding the `_fivetran_earliest` column.

### Updates

Suppose the existing table in destination is as follows:

Id(PK) | COL1 | COL2 | _fivetran_start(PK) | _fivetran_end | _fivetran_active | _fivetran_synced
--- | --- | --- | --- | --- | --- | ---
1 | abc | 1 | T1 | T2-1 | FALSE | T100
1 | pqr | 2 | T2 | TMAX | TRUE | T101
2 | mno | 3 | T2 | TMAX | TRUE | T103


At the source, record with Id = 1 is updated:

Id(PK) | COL1 | Timestamp | Type
--- | --- | --- | ---
1 | xyz | T3 | Updated



and record with Id = 2 is updated:

Id(PK) | COL2 | Timestamp | Type
--- | --- | --- | ---
2 | 1000 | T4 | Updated

And lastly, record with Id = 1 is again updated:

Id(PK) | COL1 | Timestamp | Type
--- | --- | --- | ---
1 | def | T5 | Updated



The update batch file will be as follows:


Id(PK) | COL1 | COL2 | _fivetran_start(PK) | _fivetran_end | _fivetran_active | _fivetran_earliest | _fivetran_synced
--- | --- | --- | --- | --- | --- | --- | ---
1 | xyz | | T3| T5-1 | FALSE | TRUE | T107
2 | | 1000 | T4 | TMAX | TRUE | TRUE | T108
1 | def | | T5 | TMAX | TRUE | FALSE | T109


Final Destination Table will be as follows:

Id(PK) | COL1 | COL2 | _fivetran_start(PK) | _fivetran_end | _fivetran_active | _fivetran_synced
--- | --- | --- | --- | --- | --- | ---
1 | abc | 1 | T1 | T2-1 | FALSE | T100
1 | pqr | 2 | T2 | T3-1 | FALSE | T101
2 | mno | 3 | T2 | T4-1 | FALSE | T103
1 | def | 2 | T5 | TMAX | TRUE | T109
1 | xyz | 2 | T3 | T5-1 | FALSE | T107
2 | mno | 1000 | T4 | TMAX | TRUE | T108



**Explanation:**
- In the batch file, we got records with id = 1 and id = 2.
- We set unmodified columns' values to the values of the active records. In this example, for id = 2, we didn’t get COL1 value, so we set COL1 to “mno” (COL1 value of the active record).
- We set `_fivetran_end` of the active record in the destination table to `_fivetran_start`-1 of the `_fivetran_earliest` record of the batch file.
- We set the `_fivetran_active` column value for the above updated record to FALSE.
- We set other columns _as is_ from the batch file in the destination table except the `_fivetran_earliest` column.


### Deletes

Suppose the existing table in the destination is as follows:

Id(PK) | COL1 | COL2 | _fivetran_start(PK) | _fivetran_end | _fivetran_active | _fivetran_synced
--- | --- | --- | --- | --- | --- | ---
1 | abc | 1 | T1 | T2-1 | FALSE | T100
1 | pqr | 2 | T2 | TMAX | TRUE | T101
2 | mno | 3 | T2 | TMAX | TRUE | T103



At the source, a record is deleted:


Id(PK) | Timestamp | Type
--- | --- | ---
1 | T3 | Deleted


The delete batch file will be as follows:

Id(PK) | _fivetran_start(PK) | _fivetran_end | _fivetran_active | _fivetran_earliest | _fivetran_synced
--- | --- |---------------|------| --- | ---
1 | | T3-1 | | TRUE | T104


The final destination table will be as follows:

Id(PK) | COL1 | COL2 | _fivetran_start(PK) | _fivetran_end | _fivetran_active | _fivetran_synced
--- | --- | --- | --- | --- |------------------| ---
1 | abc | 1 | T1 | T2-1 | FALSE | T100
1 | pqr | 2 | T2 | T3-1 | FALSE | T101
2 | mno | 3 | T2 | TMAX | TRUE | T103

**Explanation:**
- We set the `_fivetran_active` column value to FALSE for the active record and set the `_fivetran_end` column value to `T3-1`.


Loading