Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Layers #17

Closed
wants to merge 2 commits into from
Closed

Data Layers #17

wants to merge 2 commits into from

Conversation

c42f
Copy link
Contributor

@c42f c42f commented Jun 1, 2021

Data layers allow data of different formats to be mapped into a program through a decoder and presented with a uniform API such that the main program logic can avoid dealing with data format decoding. Instead, the data format can be defined in the Data.toml.

A challenge here is dealing with world age issues which come up from dynamically requireing Julia packages. For now, we include a bit of judicious Base.invokelatest to make things "just work" in the REPL, but also warn the user that they should add a top-level import.

With this patch and the Data.toml from the tests, we can open several tabular data formats, without the user needing to know much about the data storage.

Here's an example of loading data in .tsv, .gzip.csv and .arrow formats (any of which could then be converted to a DataFrame thanks to the Tables.jl interface)

julia> @! open(dataset("table_tsv"))
┌ Warning: The package CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b] is required to load your dataset. DataSets will import this module for you, but this may not always work as
│ expected.
│ 
│ To silence this message, add import CSV at the top of your code somewhere.
└                                                            @ DataSets /home/chris/.julia/dev/DataSets/src/layers.jl:32
2-element CSV.File{false}:
 CSV.Row: (Name = "Aaron", Age = 23)
 CSV.Row: (Name = "Harry", Age = 42)

julia> @! open(dataset("table_gzip"))
┌ Warning: The package CodecZlib [944b1d66-785c-5afd-91f1-9de20f533193] is required to load your dataset. DataSets will import this module for you, but this may not always
│ work as expected.
│ 
│ To silence this message, add import CodecZlib at the top of your code somewhere.
└                                                            @ DataSets /home/chris/.julia/dev/DataSets/src/layers.jl:32
2-element CSV.File{false}:
 CSV.Row: (Name = "Aaron", Age = 23)
 CSV.Row: (Name = "Harry", Age = 42)

julia> @! open(dataset("table_arrow"))
┌ Warning: The package Arrow [69666777-d1a9-59fb-9406-91d4454c9d45] is required to load your dataset. DataSets will import this module for you, but this may not always work
│ as expected.
│ 
│ To silence this message, add import Arrow at the top of your code somewhere.
└                                                            @ DataSets /home/chris/.julia/dev/DataSets/src/layers.jl:32
Arrow.Table: (Name = ["Aaron", "Harry"], Age = [23, 42])

Excerpt from Data.toml, showing the configuration required for the system to understand these various formats:

[[datasets]]
description="Simple TSV example"
name="table_tsv"
uuid="efde65c3-a898-4ba9-97c1-45dba64b8d46"

    [datasets.storage]
    driver="FileSystem"
    type="Blob"
    path="@__DIR__/data/people.tsv"

    [[datasets.layers]]
    type = "csv"
    [datasets.layers.parameters]
        delim="\t"

[[datasets]]
description="Gzipped CSV example"
name="table_gzip"
uuid="2d126588-5f76-4e53-8245-87dc91625bf4"

    [datasets.storage]
    driver="FileSystem"
    type="Blob"
    path="@__DIR__/data/people.csv.gz"

    [[datasets.layers]]
    type = "gzip"

    [[datasets.layers]]
    type = "csv"

[[datasets]]
description="Arrow example"
name="table_arrow"
uuid="e964d100-fef2-45c4-85de-9d8e142f4084"

    [datasets.storage]
    driver="FileSystem"
    type="Blob"
    path="@__DIR__/data/people.arrow"

    [[datasets.layers]]
    type = "arrow"

More generally than tabular data, here's some further examples of data which comes encoded in many forms, but we'd like to treat through the same data loader API:

Byte streams:

  • raw
  • gzip
  • xz
  • zstd
  • ...

Images

  • png
  • jpeg
  • tiff
  • ...

Data trees

  • directories
  • zip
  • hdf5
  • ...

@StefanKarpinski
Copy link
Member

I thought we'd discussed not using @! and making the context explicit instead.

@c42f
Copy link
Contributor Author

c42f commented Jun 2, 2021

I thought we'd discussed not using @! and making the context explicit instead.

Yes, but then we decided to use finalizers instead, where possible, and not expose the context to users at all. That's what was implemented in #12 for Blob and BlobTree (which needed to become mutable as a result).

You'll note that #12 contains no mention of ResourceContexts.jl in the documentation update.

Also, the above is purely optional use of @! — explicit context passing is fine too:

ctx = ResourceContext()

data = open(ctx, dataset("table_tsv"))

@c42f
Copy link
Contributor Author

c42f commented Jun 2, 2021

That's what was implemented in #12 for Blob and BlobTree (which needed to become mutable as a result).

Of course, the issue with the finalizer approach is that it doesn't work with some third-party types such as CSV.CSVFile, which are immutable and can't have finalizers attached. Ideas?

c42f added 2 commits June 2, 2021 10:57
Data layers allow data of different formats to be mapped into a program
through a decoder and presented with a uniform API such that the main
program logic can avoid dealing with data format decoding. Instead, the
data format can be defined in the Data.toml.

A challenge here is dealing with world age issues which come up from
dynamically `require`ing Julia packages. For now, we include a bit of
judicious Base.invokelatest to make things "just work" in the REPL, but
also warn the user that they should add a top-level import.

Here's some examples of data which comes encoded in many forms, but we'd
like to treat through the same API:

Tabular data:

* csv
* gzip.csv
* tsv
* arrow
* parquet
* ...

Byte streams:

* raw
* gzip
* xz
* zstd
* ...

Images

* png
* jpeg
* tiff
* ...

Data trees

* directories
* zip
* hdf5
@c42f c42f force-pushed the cjf/data-layers branch from 0c7fc0b to 6cd6b46 Compare June 2, 2021 00:58
@StefanKarpinski
Copy link
Member

Ideas?

Return a mutable wrapper object, perhaps? Either that or if the object is immutable, throw an error and require the caller to use the explicit context form (or the @! shorthand).

@c42f
Copy link
Contributor Author

c42f commented Jun 9, 2021

Return a mutable wrapper object, perhaps? Either that or if the object is immutable, throw an error and require the caller to use the explicit context form (or the @! shorthand).

Thanks, I think these are the options. I've been mulling it over but haven't come up with anything else yet.

With wrappers, there seems to be two alternatives

  • Return something very generic like Ref{T}.
    • Pro: Works for all types
    • Con: Doesn't have a useful API; must be unwrapped to do anything. Quite clumsy and not similar to API for types which happen to be mutable and don't need wrapping.
    • Con: After unwrapping, users will want to drop the wrapper in which case their resources will be closed
  • Return a wrapper with the right API, for example a hypothetical WrappedTable for tabular data
    • Pro: User friendly
    • Con: Lots of wrappers to implement, doesn't easily scale to many disparate packages
    • Con: Correct API for wrappers may be unclear. In the extreme, just an exact duplicate of the wrapped object.

All together, wrappers don't seem very appealing. I'm inclined to just error and direct the user to the explicit context-based API for the generic code path.

As a hybrid, we could implement a few wrappers for APIs which are relatively well defined and commonly used, eg, tables.

@StefanKarpinski
Copy link
Member

Honestly it seems most appealing to me to just always require the context object. Once people learn to do this it will always work.

@layne-sadler
Copy link

layne-sadler commented Aug 16, 2021

hmm. so I could a file that is encrypted + compressed, and layers would allow the program to peel this back to handle that on the fly? what other types of preprocessing could be layers? user-defined layers?

@c42f
Copy link
Contributor Author

c42f commented Aug 17, 2021

so I could a file that is encrypted + compressed, and layers would allow the program to peel this back to handle that on the fly?

Yes, this should be possible. I think the interesting/tricky thing here is having a way to provide parameters to layers. In particular, how would we inject the decryption keys in a secure way? I suppose these are logically a property of the DataSet, but you also don't want to leave keys lying around in memory.

what other types of preprocessing could be layers?

Anything that represents a linear pipeline of decoding stages could be represented. (Conversely, more general DAGs cannot be represented as cleanly — the whole DAG would have to be represented as single non-composable layer.)

user-defined layers?

Yes, in this PR the user should be able to define their own layer by calling DataSets.register_layer! in their third-party module (probably as part of the module's __init__ function) and defining a method with the signature open(layer::DataLayer{:users_custom_tag}, blob::Blob).

@mortenpi
Copy link
Member

I'll go ahead and close this PR, since I don't think we'll merge it. But the branch and discussion will stay around for future reference.

@mortenpi mortenpi closed this Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants