Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way to split a Frame into two #126

Open
miguelfdag opened this issue Dec 26, 2018 · 8 comments
Open

Best way to split a Frame into two #126

miguelfdag opened this issue Dec 26, 2018 · 8 comments

Comments

@miguelfdag
Copy link

What is the most efficient way to split a frame when given the number of elements one of the subframes should have?

@acowley
Copy link
Owner

acowley commented Dec 26, 2018

As it happens, there is a branch with such functions. I was waiting to hear from another requester of that feature to see if these pieces worked, but haven’t heard back.

The main design question is if you are streaming your data or already have it in memory. If streaming, then we allocate distinct blocks of memory for each chunk so that they can easily be individually serialized or garbage collected. If you have the data in memory, the chunks are offsets into a shared block of memory.

@miguelfdag
Copy link
Author

miguelfdag commented Dec 26, 2018 via email

@acowley
Copy link
Owner

acowley commented Dec 26, 2018

I would first use the in-memory split since the training algorithm will make multiple passes over the data.

@miguelfdag
Copy link
Author

miguelfdag commented Dec 26, 2018 via email

@acowley
Copy link
Owner

acowley commented Dec 27, 2018

This depends on how large your samples are. What you are doing there is not at all bad: you are lazily computing a list of integer array indices. I would start with that approach, too.

@miguelfdag
Copy link
Author

miguelfdag commented Dec 27, 2018 via email

@acowley
Copy link
Owner

acowley commented Dec 27, 2018

toFrame does evaluate things, but your partial application of frameRow to the df value shares that data across the shuffled indices.

@miguelfdag
Copy link
Author

Sorry for the long lapse in time.

Just to make sure I understood it correctly, the most efficient way to deal with the dataset is to have it stored in a Frame, but then, when performing operations with it, to use it as a list?

For example, my train/test split function is this:

frShuffle :: Frame a -> Int -> [a]
frShuffle fr seed = fmap (frameRow fr) randList where
    randList = shuffle' [0..(len-1)] len (mkStdGen seed)
    len      = frameLength fr

trainTestSplit :: Frame a -> Double -> Int -> ([a], [a])
trainTestSplit fr ratio seed = splitAt trainSize $ frShuffle fr seed 
    where 
      trainSize = floor $ fromIntegral (frameLength fr) * ratio 

So if I want to perform any later operation, is it better to convert those lists returned by trainTestSplit into frames again using toFrame? or should I simply use them as lists?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants