-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best way to split a Frame into two #126
Comments
As it happens, there is a branch with such functions. I was waiting to hear from another requester of that feature to see if these pieces worked, but haven’t heard back. The main design question is if you are streaming your data or already have it in memory. If streaming, then we allocate distinct blocks of memory for each chunk so that they can easily be individually serialized or garbage collected. If you have the data in memory, the chunks are offsets into a shared block of memory. |
I intend to use it for a train/test split for machine learning. I am
guessing the stream approach fits better, but I am not entirely sure. How
do you suggest I do it?
…On Wed, Dec 26, 2018, 16:32 Anthony Cowley ***@***.*** wrote:
As it happens, there is a branch with such functions
<https://github.com/acowley/Frames/blob/chunks/src/Frames/InCore.hs>. I
was waiting to hear from another requester of that feature to see if these
pieces worked, but haven’t heard back.
The main design question is if you are streaming your data or already have
it in memory. If streaming, then we allocate distinct blocks of memory for
each chunk so that they can easily be individually serialized or garbage
collected. If you have the data in memory, the chunks are offsets into a
shared block of memory.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#126 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AcsuXN3EFvkIUfutYAG61R6yor3YIJAvks5u85aWgaJpZM4ZhwsU>
.
|
I would first use the in-memory split since the training algorithm will make multiple passes over the data. |
I also tried shuffling the frame records, but I fear my implementation
won't be very efficient:
fmap (frameRow df) (shuffled [0..(len-1)]
And then converting it to a frame. Is there another way of doing it?
…On Wed, Dec 26, 2018, 18:03 Anthony Cowley ***@***.*** wrote:
I would first use the in-memory split since the training algorithm will
make multiple passes over the data.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#126 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AcsuXCOYV38N6LDJulk3qxQe_m3jjplHks5u86vRgaJpZM4ZhwsU>
.
|
This depends on how large your samples are. What you are doing there is not at all bad: you are lazily computing a list of integer array indices. I would start with that approach, too. |
So by applying toFrame, it doesn't immediately evaluate the whole thing?
…On Thu, Dec 27, 2018, 04:08 Anthony Cowley ***@***.*** wrote:
This depends on how large your samples are. What you are doing there is
not at all bad: you are lazily computing a list of integer array indices. I
would start with that approach, too.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#126 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AcsuXFWDiADbAjcElEGd5id4Y0oWs_GBks5u9DmxgaJpZM4ZhwsU>
.
|
|
Sorry for the long lapse in time. Just to make sure I understood it correctly, the most efficient way to deal with the dataset is to have it stored in a For example, my train/test split function is this:
So if I want to perform any later operation, is it better to convert those lists returned by |
What is the most efficient way to split a frame when given the number of elements one of the subframes should have?
The text was updated successfully, but these errors were encountered: