-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering w/o Pandas? #407
Comments
Using this function from StackOverflow (code below), a dataframe of a mixed data set is about twice the size of the actual data set (the data set in question is deduplicated ThermoML, so 704447 properties).
Code:
|
I thought the disparity would be smaller with a one-property-type data set, but the dataframe number is suspiciously close to the one for all of ThermoML. There are only 2912 properties in this data set. Hmmmm.
|
Ref #506 (though I'm not caught up enough to understand this one) |
Have you considered filtering data sets without converting to Pandas under the hood? It can be difficult to hold both in memory at the same time, especially as the dataframe is wide format. This would also allow for progress bars. My intuition is that creating an index mask array from the properties directly and creating a new dataset from that will be substantially faster than converting to and from pandas.
Edit: happy to implement this and benchmark it, but I'm not sure how many users are relying on using dataframes for filtering.
Edit: It's also kind of odd that the filters don't take units, given the emphasis placed on units elsewhere by OpenFF. While it's clearly because of the imposed units from the dataframe, it's also not clear at all if you start the filtering wiht a dataset. e.g.
The text was updated successfully, but these errors were encountered: