Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 4] Incremental parsing for CSV with Headers #22

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

CristhianMotoche
Copy link
Contributor

@CristhianMotoche CristhianMotoche commented Oct 4, 2022

It solves part of #4. I'll add the changes 'incrementally' (pun intended).

Changes:

  • Add new module Data.Csv.Parser.Megaparsec.Incremental
  • Expose decodeHeader
  • Add tests for it

TODO:

  • Fix TODO
  • Document code
  • Address some edge cases

@CristhianMotoche CristhianMotoche marked this pull request as ready for review October 7, 2022 17:55
@CristhianMotoche
Copy link
Contributor Author

Hey @cptrodolfox After some thought, I see this doesn't really solve #4. I'm a bit consufed on the state of Megaparsec for incremental parsing. I see in the Changelog it's mentioned that:

Now state returned on failure is the exact state of parser at the moment when it failed, which makes incremental parsing feature much better and opens possibilities for features like “on-the-fly” recovering from parse errors.

This was for version 4.4.0 which was relesed on Feb 2016. However, I later noticed this issue on Megaparsec and it seems:

Megaparsec is not a streaming/incremental parsing library

Therefore, I assume we won't actually have an easy way to solve #4 at the moment.

I'll take a look at it later to be 100% sure if that is an impossible issue to fix for now.

@CristhianMotoche CristhianMotoche marked this pull request as draft October 9, 2022 15:02
@CristhianMotoche
Copy link
Contributor Author

Hey @cptrodolfox I think a possible solution would be what @mrkkrp suggested here:

You could perhaps write a parser for dealing with a single line and then glue everything together with a streaming library.

I think that could be a possible option. Nevertheless, I would prefer to implement it in a separated library (e.g. cassava-megaparsec-incremental) to avoid adding an extra library to cassava-megaparsec. What do you think?

@cptrodolfox
Copy link
Contributor

Hey @cptrodolfox I think a possible solution would be what @mrkkrp suggested here:

You could perhaps write a parser for dealing with a single line and then glue everything together with a streaming library.

I think that could be a possible option. Nevertheless, I would prefer to implement it in a separated library (e.g. cassava-megaparsec-incremental) to avoid adding an extra library to cassava-megaparsec. What do you think?

Hey @CristhianMotoche , sorry for the late reply. I think that a better approach is to see how does cassava implement incremental parsing. From what I can gather from cassava's source code it implements its own type for incremental parsing.

https://hackage.haskell.org/package/cassava-0.5.3.0/docs/src/Data.Csv.Incremental.html#Parser

@CristhianMotoche
Copy link
Contributor Author

Hey @cptrodolfox Sorry for the late reply as well.

Hey @CristhianMotoche , sorry for the late reply. I think that a better approach is to see how does cassava implement incremental parsing. From what I can gather from cassava's source code it implements its own type for incremental parsing.

That was my approach at first. cassava has the HeaderParser data type which has three constructors: FailH, PartialH and DoneH. It implements incremental parsing by pattern matching the results of attoparsec which include one constructor (called Partial in the IResult data type) to continue with the parsing.

Therefore, we cannot have an incremental parsing like the one of cassava since megaparsec doesn't provide that. Does it make sense?

@CristhianMotoche
Copy link
Contributor Author

Hey @cptrodolfox I've been trying to replicate something similar to decodeWithP from cassava but I don't think it will be possible since the only possible results of Text.Megaparsec.parse is either an error or a record result. I was trying to consume some input until a breakline but there could be breaklines in the middle of text (e.g. foo,bar,"hello\nworld",123). I'm running out of ideas to solve this in an incremental way. Please, let me know if you have something in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants