Skip to content

Commit

Permalink
Merge branch 'main' into recover_from_incomplete_multi_part_checkpoints
Browse files Browse the repository at this point in the history
  • Loading branch information
sebastiantia authored Jan 21, 2025
2 parents 6990a6d + 8494126 commit 9c374c7
Show file tree
Hide file tree
Showing 61 changed files with 1,119 additions and 691 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
.idea/
.vscode/
.vim
.zed

# Rust
.cargo/
Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Delta-kernel-rs is split into a few different crates:
- kernel: The actual core kernel crate
- acceptance: Acceptance tests that validate correctness via the [Delta Acceptance Tests][dat]
- derive-macros: A crate for our [derive-macros] to live in
- ffi: Functionallity that enables delta-kernel-rs to be used from `C` or `C++` See the [ffi](ffi)
- ffi: Functionality that enables delta-kernel-rs to be used from `C` or `C++` See the [ffi](ffi)
directory for more information.

## Building
Expand Down Expand Up @@ -66,12 +66,12 @@ are still unstable. We therefore may break APIs within minor releases (that is,
we will not break APIs in patch releases (`0.1.0` -> `0.1.1`).

## Arrow versioning
If you enable the `default-engine` or `sync-engine` features, you get an implemenation of the
If you enable the `default-engine` or `sync-engine` features, you get an implementation of the
`Engine` trait that uses [Arrow] as its data format.

The [`arrow crate`](https://docs.rs/arrow/latest/arrow/) tends to release new major versions rather
quickly. To enable engines that already integrate arrow to also integrate kernel and not force them
to track a specific version of arrow that kernel depends on, we take as broad dependecy on arrow
to track a specific version of arrow that kernel depends on, we take as broad dependency on arrow
versions as we can.

This means you can force kernel to rely on the specific arrow version that your engine already uses,
Expand All @@ -96,7 +96,7 @@ arrow-schema = "53.0"
parquet = "53.0"
```

Note that unfortunatly patching in `cargo` requires that _exactly one_ version matches your
Note that unfortunately patching in `cargo` requires that _exactly one_ version matches your
specification. If only arrow "53.0.0" had been released the above will work, but if "53.0.1" where
to be released, the specification will break and you will need to provide a more restrictive
specification like `"=53.0.0"`.
Expand All @@ -111,7 +111,7 @@ and then checking what version of `object_store` it depends on.
## Documentation

- [API Docs](https://docs.rs/delta_kernel/latest/delta_kernel/)
- [arcitecture.md](doc/architecture.md) document describing the kernel architecture (currently wip)
- [architecture.md](doc/architecture.md) document describing the kernel architecture (currently wip)

## Examples

Expand Down
2 changes: 1 addition & 1 deletion ffi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ This crate provides a c foreign function internface (ffi) for delta-kernel-rs.
You can build static and shared-libraries, as well as the include headers by simply running:

```sh
cargo build [--release] [--features default-engine]
cargo build [--release]
```

This will place libraries in the root `target` dir (`../target/[debug,release]` from the directory containing this README), and headers in `../target/ffi-headers`. In that directory there will be a `delta_kernel_ffi.h` file, which is the C header, and a `delta_kernel_ffi.hpp` which is the C++ header.
Expand Down
4 changes: 2 additions & 2 deletions ffi/examples/read-table/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ This example is built with [cmake]. Instructions below assume you start in the d
Note that prior to building these examples you must build `delta_kernel_ffi` (see [the FFI readme] for details). TLDR:
```bash
# from repo root
$ cargo build -p delta_kernel_ffi [--release] [--features default-engine, tracing]
$ cargo build -p delta_kernel_ffi [--release] --features tracing
# from ffi/ dir
$ cargo build [--release] [--features default-engine, tracing]
$ cargo build [--release] --features tracing
```

There are two configurations that can currently be configured in cmake:
Expand Down
4 changes: 2 additions & 2 deletions ffi/examples/read-table/arrow.c
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ static GArrowRecordBatch* add_partition_columns(
}

GArrowArray* partition_col = garrow_array_builder_finish((GArrowArrayBuilder*)builder, &error);
if (report_g_error("Can't build string array for parition column", error)) {
if (report_g_error("Can't build string array for partition column", error)) {
printf("Giving up on column %s\n", col);
g_error_free(error);
g_object_unref(builder);
Expand Down Expand Up @@ -144,7 +144,7 @@ static void add_batch_to_context(
}
record_batch = add_partition_columns(record_batch, partition_cols, partition_values);
if (record_batch == NULL) {
printf("Failed to add parition columns, not adding batch\n");
printf("Failed to add partition columns, not adding batch\n");
return;
}
context->batches = g_list_append(context->batches, record_batch);
Expand Down
2 changes: 1 addition & 1 deletion ffi/examples/read-table/read_table.c
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ void print_partition_info(struct EngineContext* context, const CStringMap* parti
}

// Kernel will call this function for each file that should be scanned. The arguments include enough
// context to constuct the correct logical data from the physically read parquet
// context to construct the correct logical data from the physically read parquet
void scan_row_callback(
void* engine_context,
KernelStringSlice path,
Expand Down
2 changes: 1 addition & 1 deletion ffi/src/engine_funcs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ impl Drop for FileReadResultIterator {
}
}

/// Call the engine back with the next `EngingeData` batch read by Parquet/Json handler. The
/// Call the engine back with the next `EngineData` batch read by Parquet/Json handler. The
/// _engine_ "owns" the data that is passed into the `engine_visitor`, since it is allocated by the
/// `Engine` being used for log-replay. If the engine wants the kernel to free this data, it _must_
/// call [`free_engine_data`] on it.
Expand Down
2 changes: 1 addition & 1 deletion ffi/src/expressions/kernel.rs
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ pub struct EngineExpressionVisitor {
/// Visit a 64bit timestamp belonging to the list identified by `sibling_list_id`.
/// The timestamp is microsecond precision with no timezone.
pub visit_literal_timestamp_ntz: VisitLiteralFn<i64>,
/// Visit a 32bit intger `date` representing days since UNIX epoch 1970-01-01. The `date` belongs
/// Visit a 32bit integer `date` representing days since UNIX epoch 1970-01-01. The `date` belongs
/// to the list identified by `sibling_list_id`.
pub visit_literal_date: VisitLiteralFn<i32>,
/// Visit binary data at the `buffer` with length `len` belonging to the list identified by
Expand Down
4 changes: 2 additions & 2 deletions ffi/src/handle.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
//! boundary.
//!
//! Creating a [`Handle<T>`] always implies some kind of ownership transfer. A mutable handle takes
//! ownership of the object itself (analagous to [`Box<T>`]), while a non-mutable (shared) handle
//! takes ownership of a shared reference to the object (analagous to [`std::sync::Arc<T>`]). Thus, a created
//! ownership of the object itself (analogous to [`Box<T>`]), while a non-mutable (shared) handle
//! takes ownership of a shared reference to the object (analogous to [`std::sync::Arc<T>`]). Thus, a created
//! handle remains [valid][Handle#Validity], and its underlying object remains accessible, until the
//! handle is explicitly dropped or consumed. Dropping a mutable handle always drops the underlying
//! object as well; dropping a shared handle only drops the underlying object if the handle was the
Expand Down
2 changes: 1 addition & 1 deletion ffi/src/scan.rs
Original file line number Diff line number Diff line change
Expand Up @@ -383,7 +383,7 @@ struct ContextWrapper {
/// data which provides the data handle and selection vector as each element in the iterator.
///
/// # Safety
/// engine is responsbile for passing a valid [`ExclusiveEngineData`] and selection vector.
/// engine is responsible for passing a valid [`ExclusiveEngineData`] and selection vector.
#[no_mangle]
pub unsafe extern "C" fn visit_scan_data(
data: Handle<ExclusiveEngineData>,
Expand Down
9 changes: 4 additions & 5 deletions ffi/src/test_ffi.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ use delta_kernel::{
/// output expression can be found in `ffi/tests/test_expression_visitor/expected.txt`.
///
/// # Safety
/// The caller is responsible for freeing the retured memory, either by calling
/// The caller is responsible for freeing the returned memory, either by calling
/// [`free_kernel_predicate`], or [`Handle::drop_handle`]
#[no_mangle]
pub unsafe extern "C" fn get_testing_kernel_expression() -> Handle<SharedExpression> {
Expand All @@ -25,18 +25,17 @@ pub unsafe extern "C" fn get_testing_kernel_expression() -> Handle<SharedExpress
let array_data = ArrayData::new(array_type.clone(), vec![Scalar::Short(5), Scalar::Short(0)]);

let nested_fields = vec![
StructField::new("a", DataType::INTEGER, false),
StructField::new("b", array_type, false),
StructField::not_null("a", DataType::INTEGER),
StructField::not_null("b", array_type),
];
let nested_values = vec![Scalar::Integer(500), Scalar::Array(array_data.clone())];
let nested_struct = StructData::try_new(nested_fields.clone(), nested_values).unwrap();
let nested_struct_type = StructType::new(nested_fields);

let top_level_struct = StructData::try_new(
vec![StructField::new(
vec![StructField::nullable(
"top",
DataType::Struct(Box::new(nested_struct_type)),
true,
)],
vec![Scalar::Struct(nested_struct)],
)
Expand Down
8 changes: 4 additions & 4 deletions integration-tests/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,16 @@ fn create_arrow_schema() -> arrow::datatypes::Schema {

fn create_kernel_schema() -> delta_kernel::schema::Schema {
use delta_kernel::schema::{DataType, Schema, StructField};
let field_a = StructField::new("a", DataType::LONG, false);
let field_b = StructField::new("b", DataType::BOOLEAN, false);
let field_a = StructField::not_null("a", DataType::LONG);
let field_b = StructField::not_null("b", DataType::BOOLEAN);
Schema::new(vec![field_a, field_b])
}

fn main() {
let arrow_schema = create_arrow_schema();
let kernel_schema = create_kernel_schema();
let convereted: delta_kernel::schema::Schema =
let converted: delta_kernel::schema::Schema =
delta_kernel::schema::Schema::try_from(&arrow_schema).expect("couldn't convert");
assert!(kernel_schema == convereted);
assert!(kernel_schema == converted);
println!("Okay, made it");
}
2 changes: 1 addition & 1 deletion kernel/examples/inspect-table/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ fn print_scan_file(
fn try_main() -> DeltaResult<()> {
let cli = Cli::parse();

// build a table and get the lastest snapshot from it
// build a table and get the latest snapshot from it
let table = Table::try_from_uri(&cli.path)?;

let engine = DefaultEngine::try_new(
Expand Down
6 changes: 3 additions & 3 deletions kernel/examples/read-table-multi-threaded/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ fn truncate_batch(batch: RecordBatch, rows: usize) -> RecordBatch {
RecordBatch::try_new(batch.schema(), cols).unwrap()
}

// This is the callback that will be called fo each valid scan row
// This is the callback that will be called for each valid scan row
fn send_scan_file(
scan_tx: &mut spmc::Sender<ScanFile>,
path: &str,
Expand All @@ -125,7 +125,7 @@ fn send_scan_file(
fn try_main() -> DeltaResult<()> {
let cli = Cli::parse();

// build a table and get the lastest snapshot from it
// build a table and get the latest snapshot from it
let table = Table::try_from_uri(&cli.path)?;
println!("Reading {}", table.location());

Expand Down Expand Up @@ -279,7 +279,7 @@ fn do_work(

// this example uses the parquet_handler from the engine, but an engine could
// choose to use whatever method it might want to read a parquet file. The reader
// could, for example, fill in the parition columns, or apply deletion vectors. Here
// could, for example, fill in the partition columns, or apply deletion vectors. Here
// we assume a more naive parquet reader and fix the data up after the fact.
// further parallelism would also be possible here as we could read the parquet file
// in chunks where each thread reads one chunk. The engine would need to ensure
Expand Down
2 changes: 1 addition & 1 deletion kernel/examples/read-table-single-threaded/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ fn main() -> ExitCode {
fn try_main() -> DeltaResult<()> {
let cli = Cli::parse();

// build a table and get the lastest snapshot from it
// build a table and get the latest snapshot from it
let table = Table::try_from_uri(&cli.path)?;
println!("Reading {}", table.location());

Expand Down
Loading

0 comments on commit 9c374c7

Please sign in to comment.