ignore unused fields from CSV file. #131

Ankur-deDev · 2019-02-15T09:18:15Z

Hi,

Let's imagine I am working with a tabular data file which header might look like this "field1,field2,field3".

I am only using "field1" and "field2". "field3" might be there or not, a new field "field4" might appear in the future.. I do not have control over the number of columns, I just know that the fields I need are there.

Is there an easy way for me to define a type that contains only the fields I need and parse a CSV into a Frame possibly ignoring all the other unused fields or simply parse them as text?
If possible I would like to avoid using an example CSV file for types at compilation time.

acowley · 2019-02-15T16:01:29Z

The first part of your scenario is along the lines of rcast from vinyl, but the parsing part is a tricky fit. The issue is that the parsing code in Frames is specifically designed to match static types in the code with the data from a file, whereas in your situation you want something a bit less static as a parse result.

If it were possible, a good choice here is to treat this as two pieces of code: the processing part that just wants field1, field2, etc., and the parsing part that is specialized to a particular data file at compile time. You then have a glue part that loads some inferred row type from a CSV file, and passes it to the processing code via rcast to drop the fields you don't care about.

But if what you want is to write a program that will handle arbitrary CSV files whose only constraint is that they contain the fields you want, then this becomes much more dynamic. Let's make sure we think we agree on the wanted types before writing code. We have a ReadRec type class in Frames that assumes a list of Text pieces correspond to the wanted fields of a record. This does not hold for your situation, so we will need to use the Parseable class directly. What we want to do is fit a function with a type like [Text] -> [Text] -> Record ts into the pipes part of the parsing code. The first [Text] is a list of column names found the CSV file, while the second list is a row of separated values. We use the return type to figure out the column names we want to extract, find their indices in the list of column names, then call Frames.ColumnTypeable.parse' on each part to build up the Record ts for each row. This will give streaming parsing, and fit into the column-based storage pieces of Frames.

To be very specific, what is needed is a tweak of this function so that it does not insist on the data file matching up with the wanted type. The Producer [Text] m () can come from Frames itself or the Frames.Dsv module. We would then read off the first line to get the column names present in the data file, then map extractRecord columnNames over the remaining rows of the Producer.

Does this all sound like it addresses your needs?

Ankur-deDev · 2019-02-18T09:07:27Z

Thanks for your quick answer, I think you summarized the issue very well and as far as I understand your conclusion is correct. Being able to parse the data is the important point.

I am in favor of the second, more dynamic, approach.
The reason is to avoid compiling a new executable at each format change or having to keep older versions of all executables to be able to run a backfill on historical files.

If you think that the required changes should be feasible, there is no rush and I can go ahead without and further check that Frames is a good fit for my workflow. I am still very new to Frames and Haskell, and getting started is not as easy as some other frameworks.

acowley · 2019-02-18T20:06:17Z

I think the README needs to provide better motivation. The original point of the library was to couple the program and data file so you get compile-time errors when your analysis code diverges from the realities of your data file. Once you have extracted richer type information, you can do things like make more efficient use of memory, too.

That focus on a particular trick -- the code matches the data -- is fine as far as it goes, but better supporting custom row parsing as you need is a great way to empower the programmer to deal with varied data. I'll put some time into it as soon as I can.

Ankur-deDev · 2019-02-19T08:25:36Z

Sounds great, I will add this project in my watch list. Thanks.

Ankur-deDev · 2019-03-20T05:44:39Z

I have been working with Frames in the past month.
This is really a great library, easy to work with and it is the only one I found to fulfill all my requirements.
The only problem I still have is the one mentioned above.

Here is one related and simple practical example I have been facing too:
Imagine a program compiled with a CSV containing an unused Double column like "1.234"... it would be working fine until one day you try parsing a line with an unexpected value like ".45" which does not parse as Double and the line will be dropped.
This is not a big issue since I can just override the types of all the columns I am not using to Text, but this workaround is tedious if there are a lot of unused columns and do not fully answers the underlying need of being able to adapt dynamically to the data (allow for different number of columns or change in unused column names).

So after using Frames for my workflow, I think the approach you detailed in your previous comment would make it able to deal with more varied data and solve all the issues I could face.

Ankur-deDev · 2019-04-02T05:18:41Z

Hi,

I tried to follow your instructions above and worked on a modified version of the parsing.
This would allow to parse only a subset of columns ignoring others.
The below is an attempt to do so, it is dirty code.

The idea is to define a record type manually:

type MyRecord = '[ "FIELD1" :-> Int
                 , "FIELD3" :-> Double
                 , "FIELD4" :-> Text ]

There need to be a bit of code to parse the data file header to get the positions of each of the needed field:

-- Get the indexes of elements from a list contained into a second list.
getFieldIndexes :: [Text] -> [Text] -> [Int]
getFieldIndexes hcols rcols = Foldl.fold (Foldl.premap get_idxs Foldl.list) rcols
  where err_msg = Printf.printf "mismatch header[%s] record[%s]." (show hcols) (show rcols)
        get_val = maybe (error err_msg) id
        get_idxs = get_val . ((flip List.elemIndex) hcols)

-- Parse header line and get indexes of given fields.
parseHeader :: ParserOptions -> FilePath -> [Text] -> IO ([Int])
parseHeader opts mypath mycols = do
  -- Open file.
  handle <- SIO.openFile mypath SIO.ReadMode

  -- Read first line.
  -- We assume this is a header.
  header_line <- TIO.hGetLine handle

  -- Close the file handle.
  SIO.hClose handle

  -- Split header.
  let header_list = tokenizeRow opts  header_line

  -- Return the record column indexes.
  return (getFieldIndexes header_list mycols)

This function will be fed with the record column names :
columnHeaders (Proxy::Proxy (Record rs))

The parsing 'readRec' function then need to be changed a bit to allow an [Int] argument giving all the needed fields positions that we got above:

readRecDyn (ih:it) dt = let val = dt!!ih
		        in maybe (Compose (Left (Text.copy val)))
		           (Compose . Right . Field)
		           (parse' val) :& readRecDyn it dt

I did not look at performance and the above code is very unlikely to work in a general case as well as unlikely to be the solution you had in mind.

But for my usage it is already helpful, hopefully we can integrate this sort of feature nicely in Frames!

adamConnerSax · 2023-05-02T13:29:11Z

Hi! I wonder if the work already in https://github.com/adamConnerSax/Frames-streamly would be of help? You can see some examples in: https://github.com/adamConnerSax/Frames-streamly/blob/master/test/FramesStreamlySpec.hs.

This addresses some versions of the issues you have mentioned here: skipping columns at load time as well as some enhanced value parsing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ignore unused fields from CSV file. #131

ignore unused fields from CSV file. #131

Ankur-deDev commented Feb 15, 2019

acowley commented Feb 15, 2019

Ankur-deDev commented Feb 18, 2019

acowley commented Feb 18, 2019

Ankur-deDev commented Feb 19, 2019

Ankur-deDev commented Mar 20, 2019

Ankur-deDev commented Apr 2, 2019

adamConnerSax commented May 2, 2023

ignore unused fields from CSV file. #131

ignore unused fields from CSV file. #131

Comments

Ankur-deDev commented Feb 15, 2019

acowley commented Feb 15, 2019

Ankur-deDev commented Feb 18, 2019

acowley commented Feb 18, 2019

Ankur-deDev commented Feb 19, 2019

Ankur-deDev commented Mar 20, 2019

Ankur-deDev commented Apr 2, 2019

adamConnerSax commented May 2, 2023