Skip to content

Commit

Permalink
Updated links to point to the main branch
Browse files Browse the repository at this point in the history
  • Loading branch information
acowley committed Oct 22, 2023
1 parent 3d4601d commit ae34cd7
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 13 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ For a running example, we will use variations of the [prestige.csv](http://vince

If you have a CSV data where the values of each column may be classified by a single type, and ideally you have a header row giving each column a name, you may simply want to avoid writing out the Haskell type corresponding to each row. `Frames` provides `TemplateHaskell` machinery to infer a Haskell type for each row of your data set, thus preventing the situation where your code quietly diverges from your data.

We generate a collection of definitions generated by inspecting the data file at compile time (using `tableTypes`), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an **in-core** array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the `foldl` library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that [program](https://github.com/acowley/Frames/tree/master/test/UncurryFold.hs).
We generate a collection of definitions generated by inspecting the data file at compile time (using `tableTypes`), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an **in-core** array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the `foldl` library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that [program](https://github.com/acowley/Frames/tree/main/test/UncurryFold.hs).

```haskell
{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications #-}
Expand Down Expand Up @@ -45,7 +45,7 @@ averageRatio = L.fold (L.premap (ratio . rcast) avg) <$> loadRows

### Missing Header Row

Now consider a case where our data file lacks a header row (I deleted the first row from \`prestige.csv\`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names **do** come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by `rowGen` we care to change, passing the result to `tableTypes'`. [Link to code.](https://github.com/acowley/Frames/tree/master/test/UncurryFoldNoHeader.hs)
Now consider a case where our data file lacks a header row (I deleted the first row from \`prestige.csv\`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names **do** come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by `rowGen` we care to change, passing the result to `tableTypes'`. [Link to code.](https://github.com/acowley/Frames/tree/main/test/UncurryFoldNoHeader.hs)

```haskell
{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications #-}
Expand Down Expand Up @@ -84,7 +84,7 @@ Sometimes not every row has a value for every column. I went ahead and blanked t

"athletes",11.44,8206,8.13,,3373,NA

We can no longer parse a `Double` for that row, so we will work with row types parameterized by a `Maybe` type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the `prestige` column was parsed, only keeping those rows for which it was not, then project the `income` column from those rows, and finally throw away `Nothing` elements. [Link to code](https://github.com/acowley/Frames/tree/master/test/UncurryFoldPartialData.hs).
We can no longer parse a `Double` for that row, so we will work with row types parameterized by a `Maybe` type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the `prestige` column was parsed, only keeping those rows for which it was not, then project the `income` column from those rows, and finally throw away `Nothing` elements. [Link to code](https://github.com/acowley/Frames/tree/main/test/UncurryFoldPartialData.hs).

```haskell
{-# LANGUAGE DataKinds, FlexibleContexts, QuasiQuotes, TemplateHaskell, TypeApplications, TypeOperators #-}
Expand Down Expand Up @@ -127,7 +127,7 @@ For comparison to working with data frames in other languages, see the [tutorial

## Demos

There are various [demos](https://github.com/acowley/Frames/tree/master/demo) in the repository. Be sure to run the `getdata` build target to download the data files used by the demos! You can also download the data files manually and put them in a `data` directory in the directory from which you will be running the executables.
There are various [demos](https://github.com/acowley/Frames/tree/main/demo) in the repository. Be sure to run the `getdata` build target to download the data files used by the demos! You can also download the data files manually and put them in a `data` directory in the directory from which you will be running the executables.


## Contribute
Expand All @@ -146,9 +146,9 @@ To get just ghc and cabal in your shell, a simple `nix develop` will do.

## Benchmarks

The [benchmark](https://github.com/acowley/Frames/tree/master/benchmarks/InsuranceBench.hs) shows several ways of dealing with data when you want to perform multiple traversals.
The [benchmark](https://github.com/acowley/Frames/tree/main/benchmarks/InsuranceBench.hs) shows several ways of dealing with data when you want to perform multiple traversals.

Another [demo](https://github.com/acowley/Frames/tree/master/benchmarks/BenchDemo.hs) shows how to fuse multiple passes into one so that the full data set is never resident in memory. A [Pandas version](https://github.com/acowley/Frames/tree/master/benchmarks/panda.py) of a similar program is also provided for comparison.
Another [demo](https://github.com/acowley/Frames/tree/main/benchmarks/BenchDemo.hs) shows how to fuse multiple passes into one so that the full data set is never resident in memory. A [Pandas version](https://github.com/acowley/Frames/tree/main/benchmarks/panda.py) of a similar program is also provided for comparison.

This is a trivial program, but shows that performance is comparable to Pandas, and the memory savings of a compiled program are substantial.

Expand Down
14 changes: 7 additions & 7 deletions README.org
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@ For a running example, we will use variations of the
*** Clean Data
If you have a CSV data where the values of each column may be classified by a single type, and ideally you have a header row giving each column a name, you may simply want to avoid writing out the Haskell type corresponding to each row. =Frames= provides =TemplateHaskell= machinery to infer a Haskell type for each row of your data set, thus preventing the situation where your code quietly diverges from your data.

We generate a collection of definitions generated by inspecting the data file at compile time (using ~tableTypes~), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an *in-core* array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the =foldl= library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that [[https://github.com/acowley/Frames/tree/master/test/UncurryFold.hs][program]].
We generate a collection of definitions generated by inspecting the data file at compile time (using ~tableTypes~), then, at runtime, load that data into column-oriented storage in memory with a row-oriented interface (an *in-core* array of structures (AoS)). We're going to compute the average ratio of two columns, so we'll use the =foldl= library. Our fold will project the columns we want, and apply a function that divides one by the other after appropriate numeric type conversions. Here is the entirety of that [[https://github.com/acowley/Frames/tree/main/test/UncurryFold.hs][program]].

#+INCLUDE: "test/UncurryFold.hs" src haskell

*** Missing Header Row
Now consider a case where our data file lacks a header row (I deleted the first row from `prestige.csv`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names *do* come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by ~rowGen~ we care to change, passing the result to ~tableTypes'~. [[https://github.com/acowley/Frames/tree/master/test/UncurryFoldNoHeader.hs][Link to code.]]
Now consider a case where our data file lacks a header row (I deleted the first row from `prestige.csv`). We will provide our own name for the generated row type, our own column names, and, for the sake of demonstration, we will also specify a prefix to be added to every column-based identifier (particularly useful if the column names *do* come from a header row, and you want to work with multiple CSV files some of whose column names coincide). We customize behavior by updating whichever fields of the record produced by ~rowGen~ we care to change, passing the result to ~tableTypes'~. [[https://github.com/acowley/Frames/tree/main/test/UncurryFoldNoHeader.hs][Link to code.]]

#+INCLUDE: "test/UncurryFoldNoHeader.hs" src haskell

Expand All @@ -37,7 +37,7 @@ Sometimes not every row has a value for every column. I went ahead and blanked t
"athletes",11.44,8206,8.13,,3373,NA
#+END_EXAMPLE

We can no longer parse a ~Double~ for that row, so we will work with row types parameterized by a ~Maybe~ type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the =prestige= column was parsed, only keeping those rows for which it was not, then project the =income= column from those rows, and finally throw away ~Nothing~ elements. [[https://github.com/acowley/Frames/tree/master/test/UncurryFoldPartialData.hs][Link to code]].
We can no longer parse a ~Double~ for that row, so we will work with row types parameterized by a ~Maybe~ type constructor. We are substantially filtering our data, so we will perform this operation in a streaming fashion without ever loading the entire table into memory. Our process will be to check if the =prestige= column was parsed, only keeping those rows for which it was not, then project the =income= column from those rows, and finally throw away ~Nothing~ elements. [[https://github.com/acowley/Frames/tree/main/test/UncurryFoldPartialData.hs][Link to code]].

#+INCLUDE: "test/UncurryFoldPartialData.hs" src haskell

Expand All @@ -47,15 +47,15 @@ For comparison to working with data frames in other languages, see the

** Demos
There are various
[[https://github.com/acowley/Frames/tree/master/demo][demos]] in the repository. Be sure to run the =getdata= build target to download the data files used by the demos! You can also download the data files manually and put them in a =data= directory in the directory from which you will be running the executables.
[[https://github.com/acowley/Frames/tree/main/demo][demos]] in the repository. Be sure to run the =getdata= build target to download the data files used by the demos! You can also download the data files manually and put them in a =data= directory in the directory from which you will be running the executables.

** Benchmarks
The [[https://github.com/acowley/Frames/tree/master/benchmarks/InsuranceBench.hs][benchmark]] shows several ways of
The [[https://github.com/acowley/Frames/tree/main/benchmarks/InsuranceBench.hs][benchmark]] shows several ways of
dealing with data when you want to perform multiple traversals.

Another [[https://github.com/acowley/Frames/tree/master/benchmarks/BenchDemo.hs][demo]] shows how to fuse multiple
Another [[https://github.com/acowley/Frames/tree/main/benchmarks/BenchDemo.hs][demo]] shows how to fuse multiple
passes into one so that the full data set is never resident in
memory. A [[https://github.com/acowley/Frames/tree/master/benchmarks/panda.py][Pandas version]] of a similar program
memory. A [[https://github.com/acowley/Frames/tree/main/benchmarks/panda.py][Pandas version]] of a similar program
is also provided for comparison.

This is a trivial program, but shows that performance is comparable to
Expand Down

0 comments on commit ae34cd7

Please sign in to comment.