Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore unused fields from CSV file. #131

Open
Ankur-deDev opened this issue Feb 15, 2019 · 7 comments
Open

ignore unused fields from CSV file. #131

Ankur-deDev opened this issue Feb 15, 2019 · 7 comments

Comments

@Ankur-deDev
Copy link

Hi,

Let's imagine I am working with a tabular data file which header might look like this "field1,field2,field3".

I am only using "field1" and "field2". "field3" might be there or not, a new field "field4" might appear in the future.. I do not have control over the number of columns, I just know that the fields I need are there.

Is there an easy way for me to define a type that contains only the fields I need and parse a CSV into a Frame possibly ignoring all the other unused fields or simply parse them as text?
If possible I would like to avoid using an example CSV file for types at compilation time.

@acowley
Copy link
Owner

acowley commented Feb 15, 2019

The first part of your scenario is along the lines of rcast from vinyl, but the parsing part is a tricky fit. The issue is that the parsing code in Frames is specifically designed to match static types in the code with the data from a file, whereas in your situation you want something a bit less static as a parse result.

If it were possible, a good choice here is to treat this as two pieces of code: the processing part that just wants field1, field2, etc., and the parsing part that is specialized to a particular data file at compile time. You then have a glue part that loads some inferred row type from a CSV file, and passes it to the processing code via rcast to drop the fields you don't care about.

But if what you want is to write a program that will handle arbitrary CSV files whose only constraint is that they contain the fields you want, then this becomes much more dynamic. Let's make sure we think we agree on the wanted types before writing code. We have a ReadRec type class in Frames that assumes a list of Text pieces correspond to the wanted fields of a record. This does not hold for your situation, so we will need to use the Parseable class directly. What we want to do is fit a function with a type like [Text] -> [Text] -> Record ts into the pipes part of the parsing code. The first [Text] is a list of column names found the CSV file, while the second list is a row of separated values. We use the return type to figure out the column names we want to extract, find their indices in the list of column names, then call Frames.ColumnTypeable.parse' on each part to build up the Record ts for each row. This will give streaming parsing, and fit into the column-based storage pieces of Frames.

To be very specific, what is needed is a tweak of this function so that it does not insist on the data file matching up with the wanted type. The Producer [Text] m () can come from Frames itself or the Frames.Dsv module. We would then read off the first line to get the column names present in the data file, then map extractRecord columnNames over the remaining rows of the Producer.

Does this all sound like it addresses your needs?

@Ankur-deDev
Copy link
Author

Thanks for your quick answer, I think you summarized the issue very well and as far as I understand your conclusion is correct. Being able to parse the data is the important point.

I am in favor of the second, more dynamic, approach.
The reason is to avoid compiling a new executable at each format change or having to keep older versions of all executables to be able to run a backfill on historical files.

If you think that the required changes should be feasible, there is no rush and I can go ahead without and further check that Frames is a good fit for my workflow. I am still very new to Frames and Haskell, and getting started is not as easy as some other frameworks.

@acowley
Copy link
Owner

acowley commented Feb 18, 2019

I think the README needs to provide better motivation. The original point of the library was to couple the program and data file so you get compile-time errors when your analysis code diverges from the realities of your data file. Once you have extracted richer type information, you can do things like make more efficient use of memory, too.

That focus on a particular trick -- the code matches the data -- is fine as far as it goes, but better supporting custom row parsing as you need is a great way to empower the programmer to deal with varied data. I'll put some time into it as soon as I can.

@Ankur-deDev
Copy link
Author

Sounds great, I will add this project in my watch list. Thanks.

@Ankur-deDev
Copy link
Author

I have been working with Frames in the past month.
This is really a great library, easy to work with and it is the only one I found to fulfill all my requirements.
The only problem I still have is the one mentioned above.

Here is one related and simple practical example I have been facing too:
Imagine a program compiled with a CSV containing an unused Double column like "1.234"... it would be working fine until one day you try parsing a line with an unexpected value like ".45" which does not parse as Double and the line will be dropped.
This is not a big issue since I can just override the types of all the columns I am not using to Text, but this workaround is tedious if there are a lot of unused columns and do not fully answers the underlying need of being able to adapt dynamically to the data (allow for different number of columns or change in unused column names).

So after using Frames for my workflow, I think the approach you detailed in your previous comment would make it able to deal with more varied data and solve all the issues I could face.

@Ankur-deDev
Copy link
Author

Hi,

I tried to follow your instructions above and worked on a modified version of the parsing.
This would allow to parse only a subset of columns ignoring others.
The below is an attempt to do so, it is dirty code.

The idea is to define a record type manually:

type MyRecord = '[ "FIELD1" :-> Int
                 , "FIELD3" :-> Double
                 , "FIELD4" :-> Text ]

There need to be a bit of code to parse the data file header to get the positions of each of the needed field:

-- Get the indexes of elements from a list contained into a second list.
getFieldIndexes :: [Text] -> [Text] -> [Int]
getFieldIndexes hcols rcols = Foldl.fold (Foldl.premap get_idxs Foldl.list) rcols
  where err_msg = Printf.printf "mismatch header[%s] record[%s]." (show hcols) (show rcols)
        get_val = maybe (error err_msg) id
        get_idxs = get_val . ((flip List.elemIndex) hcols)

-- Parse header line and get indexes of given fields.
parseHeader :: ParserOptions -> FilePath -> [Text] -> IO ([Int])
parseHeader opts mypath mycols = do
  -- Open file.
  handle <- SIO.openFile mypath SIO.ReadMode

  -- Read first line.
  -- We assume this is a header.
  header_line <- TIO.hGetLine handle

  -- Close the file handle.
  SIO.hClose handle

  -- Split header.
  let header_list = tokenizeRow opts  header_line

  -- Return the record column indexes.
  return (getFieldIndexes header_list mycols)

This function will be fed with the record column names :
columnHeaders (Proxy::Proxy (Record rs))

The parsing 'readRec' function then need to be changed a bit to allow an [Int] argument giving all the needed fields positions that we got above:

readRecDyn (ih:it) dt = let val = dt!!ih
		        in maybe (Compose (Left (Text.copy val)))
		           (Compose . Right . Field)
		           (parse' val) :& readRecDyn it dt

I did not look at performance and the above code is very unlikely to work in a general case as well as unlikely to be the solution you had in mind.

But for my usage it is already helpful, hopefully we can integrate this sort of feature nicely in Frames!

@adamConnerSax
Copy link
Contributor

Hi! I wonder if the work already in https://github.com/adamConnerSax/Frames-streamly would be of help? You can see some examples in: https://github.com/adamConnerSax/Frames-streamly/blob/master/test/FramesStreamlySpec.hs.

This addresses some versions of the issues you have mentioned here: skipping columns at load time as well as some enhanced value parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants