Best way to split a Frame into two #126

miguelfdag · 2018-12-26T13:39:23Z

What is the most efficient way to split a frame when given the number of elements one of the subframes should have?

acowley · 2018-12-26T15:32:37Z

As it happens, there is a branch with such functions. I was waiting to hear from another requester of that feature to see if these pieces worked, but haven’t heard back.

The main design question is if you are streaming your data or already have it in memory. If streaming, then we allocate distinct blocks of memory for each chunk so that they can easily be individually serialized or garbage collected. If you have the data in memory, the chunks are offsets into a shared block of memory.

miguelfdag · 2018-12-26T16:11:44Z

I intend to use it for a train/test split for machine learning. I am guessing the stream approach fits better, but I am not entirely sure. How do you suggest I do it?

…

On Wed, Dec 26, 2018, 16:32 Anthony Cowley ***@***.*** wrote: As it happens, there is a branch with such functions <https://github.com/acowley/Frames/blob/chunks/src/Frames/InCore.hs>. I was waiting to hear from another requester of that feature to see if these pieces worked, but haven’t heard back. The main design question is if you are streaming your data or already have it in memory. If streaming, then we allocate distinct blocks of memory for each chunk so that they can easily be individually serialized or garbage collected. If you have the data in memory, the chunks are offsets into a shared block of memory. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#126 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AcsuXN3EFvkIUfutYAG61R6yor3YIJAvks5u85aWgaJpZM4ZhwsU> .

acowley · 2018-12-26T17:03:13Z

I would first use the in-memory split since the training algorithm will make multiple passes over the data.

miguelfdag · 2018-12-26T19:39:46Z

I also tried shuffling the frame records, but I fear my implementation won't be very efficient: fmap (frameRow df) (shuffled [0..(len-1)] And then converting it to a frame. Is there another way of doing it?

…

On Wed, Dec 26, 2018, 18:03 Anthony Cowley ***@***.*** wrote: I would first use the in-memory split since the training algorithm will make multiple passes over the data. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#126 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AcsuXCOYV38N6LDJulk3qxQe_m3jjplHks5u86vRgaJpZM4ZhwsU> .

acowley · 2018-12-27T03:08:32Z

This depends on how large your samples are. What you are doing there is not at all bad: you are lazily computing a list of integer array indices. I would start with that approach, too.

miguelfdag · 2018-12-27T09:06:08Z

So by applying toFrame, it doesn't immediately evaluate the whole thing?

…

On Thu, Dec 27, 2018, 04:08 Anthony Cowley ***@***.*** wrote: This depends on how large your samples are. What you are doing there is not at all bad: you are lazily computing a list of integer array indices. I would start with that approach, too. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#126 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AcsuXFWDiADbAjcElEGd5id4Y0oWs_GBks5u9DmxgaJpZM4ZhwsU> .

acowley · 2018-12-27T15:27:39Z

toFrame does evaluate things, but your partial application of frameRow to the df value shares that data across the shuffled indices.

miguelfdag · 2019-02-05T16:18:18Z

Sorry for the long lapse in time.

Just to make sure I understood it correctly, the most efficient way to deal with the dataset is to have it stored in a Frame, but then, when performing operations with it, to use it as a list?

For example, my train/test split function is this:

frShuffle :: Frame a -> Int -> [a]
frShuffle fr seed = fmap (frameRow fr) randList where
    randList = shuffle' [0..(len-1)] len (mkStdGen seed)
    len      = frameLength fr

trainTestSplit :: Frame a -> Double -> Int -> ([a], [a])
trainTestSplit fr ratio seed = splitAt trainSize $ frShuffle fr seed 
    where 
      trainSize = floor $ fromIntegral (frameLength fr) * ratio

So if I want to perform any later operation, is it better to convert those lists returned by trainTestSplit into frames again using toFrame? or should I simply use them as lists?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to split a Frame into two #126

Best way to split a Frame into two #126

miguelfdag commented Dec 26, 2018

acowley commented Dec 26, 2018

miguelfdag commented Dec 26, 2018 via email

acowley commented Dec 26, 2018

miguelfdag commented Dec 26, 2018 via email

acowley commented Dec 27, 2018

miguelfdag commented Dec 27, 2018 via email

acowley commented Dec 27, 2018

miguelfdag commented Feb 5, 2019

Best way to split a Frame into two #126

Best way to split a Frame into two #126

Comments

miguelfdag commented Dec 26, 2018

acowley commented Dec 26, 2018

miguelfdag commented Dec 26, 2018 via email

acowley commented Dec 26, 2018

miguelfdag commented Dec 26, 2018 via email

acowley commented Dec 27, 2018

miguelfdag commented Dec 27, 2018 via email

acowley commented Dec 27, 2018

miguelfdag commented Feb 5, 2019