Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how does Vinyl fare on B2T2? #172

Open
shriram opened this issue Aug 12, 2024 · 1 comment
Open

how does Vinyl fare on B2T2? #172

shriram opened this issue Aug 12, 2024 · 1 comment

Comments

@shriram
Copy link

shriram commented Aug 12, 2024

Hi folks — this may not be of interest to you, but just popping this in here in case it is.

Vinyl looks like it could be a good fit for properly typing B2T2, a benchmark for typed tabular programming:

https://github.com/brownplt/b2t2/

We'd certainly be very curious to see the result if you're interested in showing how far you get on the benchmark. In turn, because it's an independently-defined benchmark, it may also help you make a case for the strength and flexibility of Vinyl's design (as opposed to a benchmark you design yourself). Finally, it would show that one can have a fully typed, and hence statically safe, solution to the kinds of programs people write in dynamic languages like Python and R.

@acowley
Copy link
Contributor

acowley commented Aug 14, 2024

This is an excellent effort, @shriram, thank you for tagging this repo! I think this would be of relevance to Frames where we augment Vinyl with various functionality to aid with data frame work. That said, both this repo and that are fairly low traffic at the moment as I have been a poor maintainer of late due to time constraints. But see this issue for some reflections on API ergonomics.

To set expectations, Vinyl was an investigation into manipulating records more fluidly in Haskell. Frames was begun to demonstrate that one could have a workflow where Vinyl records are defined based on a CSV data file: field names are drawn from a CSV header row, and the column types are heuristically inferred from the data file itself. The idea being that the compiler would help keep one-off data analysis programs in sync with the data.

In order to make the data analysis part more efficient, Frames will help with switching between a columnar representation (struct of arrays) and a more naive row-based array of structs representation. That said, the flexibility of filtering and grouping columns in a data frame in Frames does not compete with what is available in the R ecosystem in particular.

I'm interested in seeing how B2T2 can help us gain some insight into what underlying algebraic operations useful manipulations are built upon. For instance, we added melt to Frames, but, while we might dress it up, it was built to match what R did more so than coming from any mathematical definition.

Setting aside limitations with type level programming in Haskell, an enduring impression I developed over the years of working with Frames is that it can be hard to avoid clunky types. For instance, beginning with a set of (String, Type) pairs feels a comfortable way to think about a record or table (accepting String as an identifier). You might then give that set of fields a name as you would a table, e.g. Person. But now you manipulate the structure of the data by removing a column, say, and you have something that gets written in Frames as something like RDel Person Age to indicate a structure isomorphic to Person with its Age column removed. But seeing, much less writing, that type is very high friction. You might come up with a name, but that is famously hard.

In data analysis scripts in R or Python, you can have similar challenges with variable naming, were you might load your data as people <- read.csv('my_data.csv') and then change the people variable to point to a table with the age column removed because you don't care about the old table anymore. If you want to retain a reference to the original data, maybe you name this new thing people2 or people_no_age, but you're rarely happy.

Much like how in Haskell we are able to avoid naming temporary values by leaning heavily on composition aided by the type checker, I found in looking at my own work with R, Python, and SQL, that I tended to either avoid naming the temporary thing by compressing things into a single expression, or I'd have a kind of linearity of naming where once data was consumed, the old name was free to refer to a value of a different type. I haven't gone back to try to demonstrate that kind of fluidity in Haskell, but I'd like to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants