Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated links to point to the main branch #181

Merged
merged 1 commit into from
Oct 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ For a running example, we will use variations of the [prestige.csv](http://vince

If you have a CSV data where the values of each column may be classified by a single type, and ideally you have a header row giving each column a name, you may simply want to avoid writing out the Haskell type corresponding to each row. `Frames` provides `TemplateHaskell` machinery to infer a Haskell type for each row of your data set, thus preventing the situation where your code quietly diverges from your data.

We generate a collection of definitions generated by inspecting the data file at compile time (using `tableTypes`), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an **in-core** array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the `foldl` library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that [program](https://github.com/acowley/Frames/tree/master/test/UncurryFold.hs).
We generate a collection of definitions generated by inspecting the data file at compile time (using `tableTypes`), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an **in-core** array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the `foldl` library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that [program](https://github.com/acowley/Frames/tree/main/test/UncurryFold.hs).

```haskell
{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications #-}
Expand Down Expand Up @@ -45,7 +45,7 @@ averageRatio = L.fold (L.premap (ratio . rcast) avg) <$> loadRows

### Missing Header Row

Now consider a case where our data file lacks a header row (I deleted the first row from \`prestige.csv\`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names **do** come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by `rowGen` we care to change, passing the result to `tableTypes'`. [Link to code.](https://github.com/acowley/Frames/tree/master/test/UncurryFoldNoHeader.hs)
Now consider a case where our data file lacks a header row (I deleted the first row from \`prestige.csv\`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names **do** come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by `rowGen` we care to change, passing the result to `tableTypes'`. [Link to code.](https://github.com/acowley/Frames/tree/main/test/UncurryFoldNoHeader.hs)

```haskell
{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications #-}
Expand Down Expand Up @@ -84,7 +84,7 @@ Sometimes not every row has a value for every column. I went ahead and blanked t

"athletes",11.44,8206,8.13,,3373,NA

We can no longer parse a `Double` for that row, so we will work with row types parameterized by a `Maybe` type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the `prestige` column was parsed, only keeping those rows for which it was not, then project the `income` column from those rows, and finally throw away `Nothing` elements. [Link to code](https://github.com/acowley/Frames/tree/master/test/UncurryFoldPartialData.hs).
We can no longer parse a `Double` for that row, so we will work with row types parameterized by a `Maybe` type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the `prestige` column was parsed, only keeping those rows for which it was not, then project the `income` column from those rows, and finally throw away `Nothing` elements. [Link to code](https://github.com/acowley/Frames/tree/main/test/UncurryFoldPartialData.hs).

```haskell
{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications, TypeOperators #-}
Expand Down Expand Up @@ -127,7 +127,7 @@ For comparison to working with data frames in other languages, see the [tutorial

## Demos

There are various [demos](https://github.com/acowley/Frames/tree/master/demo) in the repository. Be sure to run the `getdata` build target to download the data files used by the demos! You can also download the data files manually and put them in a `data` directory in the directory from which you will be running the executables.
There are various [demos](https://github.com/acowley/Frames/tree/main/demo) in the repository. Be sure to run the `getdata` build target to download the data files used by the demos! You can also download the data files manually and put them in a `data` directory in the directory from which you will be running the executables.


## Contribute
Expand All @@ -146,9 +146,9 @@ To get just ghc and cabal in your shell, a simple `nix develop` will do.

## Benchmarks

The [benchmark](https://github.com/acowley/Frames/tree/master/benchmarks/InsuranceBench.hs) shows several ways of dealing with data when you want to perform multiple traversals.
The [benchmark](https://github.com/acowley/Frames/tree/main/benchmarks/InsuranceBench.hs) shows several ways of dealing with data when you want to perform multiple traversals.

Another [demo](https://github.com/acowley/Frames/tree/master/benchmarks/BenchDemo.hs) shows how to fuse multiple passes into one so that the full data set is never resident in memory. A [Pandas version](https://github.com/acowley/Frames/tree/master/benchmarks/panda.py) of a similar program is also provided for comparison.
Another [demo](https://github.com/acowley/Frames/tree/main/benchmarks/BenchDemo.hs) shows how to fuse multiple passes into one so that the full data set is never resident in memory. A [Pandas version](https://github.com/acowley/Frames/tree/main/benchmarks/panda.py) of a similar program is also provided for comparison.

This is a trivial program, but shows that performance is comparable to Pandas, and the memory savings of a compiled program are substantial.

Expand Down
14 changes: 7 additions & 7 deletions README.org
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@ For a running example, we will use variations of the
*** Clean Data
If you have a CSV data where the values of each column may be classified by a single type, and ideally you have a header row giving each column a name, you may simply want to avoid writing out the Haskell type corresponding to each row. =Frames= provides =TemplateHaskell= machinery to infer a Haskell type for each row of your data set, thus preventing the situation where your code quietly diverges from your data.

We generate a collection of definitions generated by inspecting the data file at compile time (using ~tableTypes~), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an *in-core* array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the =foldl= library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that [[https://github.com/acowley/Frames/tree/master/test/UncurryFold.hs][program]].
We generate a collection of definitions generated by inspecting the data file at compile time (using ~tableTypes~), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an *in-core* array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the =foldl= library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that [[https://github.com/acowley/Frames/tree/main/test/UncurryFold.hs][program]].

#+INCLUDE: "test/UncurryFold.hs" src haskell

*** Missing Header Row
Now consider a case where our data file lacks a header row (I deleted the first row from `prestige.csv`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names *do* come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by ~rowGen~ we care to change, passing the result to ~tableTypes'~. [[https://github.com/acowley/Frames/tree/master/test/UncurryFoldNoHeader.hs][Link to code.]]
Now consider a case where our data file lacks a header row (I deleted the first row from `prestige.csv`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names *do* come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by ~rowGen~ we care to change, passing the result to ~tableTypes'~. [[https://github.com/acowley/Frames/tree/main/test/UncurryFoldNoHeader.hs][Link to code.]]

#+INCLUDE: "test/UncurryFoldNoHeader.hs" src haskell

Expand All @@ -37,7 +37,7 @@ Sometimes not every row has a value for every column. I went ahead and blanked t
"athletes",11.44,8206,8.13,,3373,NA
#+END_EXAMPLE

We can no longer parse a ~Double~ for that row, so we will work with row types parameterized by a ~Maybe~ type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the =prestige= column was parsed, only keeping those rows for which it was not, then project the =income= column from those rows, and finally throw away ~Nothing~ elements. [[https://github.com/acowley/Frames/tree/master/test/UncurryFoldPartialData.hs][Link to code]].
We can no longer parse a ~Double~ for that row, so we will work with row types parameterized by a ~Maybe~ type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the =prestige= column was parsed, only keeping those rows for which it was not, then project the =income= column from those rows, and finally throw away ~Nothing~ elements. [[https://github.com/acowley/Frames/tree/main/test/UncurryFoldPartialData.hs][Link to code]].

#+INCLUDE: "test/UncurryFoldPartialData.hs" src haskell

Expand All @@ -47,15 +47,15 @@ For comparison to working with data frames in other languages, see the

** Demos
There are various
[[https://github.com/acowley/Frames/tree/master/demo][demos]] in the repository. Be sure to run the =getdata= build target to download the data files used by the demos! You can also download the data files manually and put them in a =data= directory in the directory from which you will be running the executables.
[[https://github.com/acowley/Frames/tree/main/demo][demos]] in the repository. Be sure to run the =getdata= build target to download the data files used by the demos! You can also download the data files manually and put them in a =data= directory in the directory from which you will be running the executables.

** Benchmarks
The [[https://github.com/acowley/Frames/tree/master/benchmarks/InsuranceBench.hs][benchmark]] shows several ways of
The [[https://github.com/acowley/Frames/tree/main/benchmarks/InsuranceBench.hs][benchmark]] shows several ways of
dealing with data when you want to perform multiple traversals.

Another [[https://github.com/acowley/Frames/tree/master/benchmarks/BenchDemo.hs][demo]] shows how to fuse multiple
Another [[https://github.com/acowley/Frames/tree/main/benchmarks/BenchDemo.hs][demo]] shows how to fuse multiple
passes into one so that the full data set is never resident in
memory. A [[https://github.com/acowley/Frames/tree/master/benchmarks/panda.py][Pandas version]] of a similar program
memory. A [[https://github.com/acowley/Frames/tree/main/benchmarks/panda.py][Pandas version]] of a similar program
is also provided for comparison.

This is a trivial program, but shows that performance is comparable to
Expand Down