-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved memory efficiency in read_frame #96
base: master
Are you sure you want to change the base?
Conversation
Also, be specific enough about units for np.dtype('datetime64...'). 'D' is for days and 'us' is for microseconds.
A closer examination of the test coverage report reveals that all the code I’ve added has 100% test coverage. Since I removed no tests, the decline in coverage is unrelated to the PR. That said, there is no test yet that the query set’s |
Fixed a crash when the query set was the result of qs.values_list()
@wkschwartz thanks for PR 👍. I think you are on the right track here! We had a PR a few years ago to do it on the Dango side of things but your approach is definitely much better. more detail, but I think we probably need to use some conditionals to keep the tests from failing on the older versions. I'm was bit surprised that the coverage fell so much as the tests that you added were reasonable. Once again thanks for your hard work on this |
cfd45c8 adds the test I mentioned in my last comment for testing that I had to replace one of the two usages of |
Looking at the coverage results reveals that nearly all the missed lines and branches are due to backward compatibility blocks of code. |
I feel that the edit: after a while I realized that pandas figuring out the dtypes should indeed be the default. |
I have done no testing to see if the smaller dtypes are free. At least in theory, on some platforms, dtypes narrower than the platform’s pointer size (eg 64 bits on x86_64) are slower to process. I also don’t know how |
TO DO
|
I can't figure out how to test alignment-related padding space waste. Any thoughts? |
It appears that loading GeoDjango without GDAL installed is failing on Django 1.11 but not on Django 1.10. I wonder if this is a regression in Django 1.11, or am I missing something? |
If it doesn't load, then the user doesn't need it. On some versions of Django, trying to import GeoDjango without having GDAL installed is an error.
@chrisdev I can't reproduce locally the failures of the tests in Travis. Any hints? |
pd.isna was added in version 0.21, but the Travis builds still use Pandas 0.20. np.isnan is already working in these tests, so that should fix the build failure.
@wkschwartz sorry for taking so long to get back to you. Had some problems with my machine. Yes Tox is sometimes flakey but I've checked out your "memory" branch locally and it passes tests with Python 3.6, Django 2.0.2 and Pandas 0.20.1 (ok there is a deprecation warning and an expected failure? but so far so good). For the older versions of Python I suggest we can do something like this try:
from collections.abc import Mapping
except ImportError:
Mapping = dict |
Per request of @chrisdev
932852d and 28e9a97 take care of your request. The deprecation warning appears to be from The expected failure is documented under "Known Issues" in the docstring for |
NumPy's fromiter can allow us to skip allocating an intermediate list to create the NumPy NDArray. However, it only works if none of the dtypes are for Python objects, numpy.dtype('O').
My guess about the test that is failing
is that it has to do with running on Linux instead of macOS, or maybe it's not x86_64. One of those things might make the alignment rules different and cause the |
@chrisdev Get a chance to look at this yet? |
very interested in this for some of my my more memory constrained projects |
Hey @wkschwartz, the projects is removing Django<1.11 compat, |
1.5 years later and I am also hitting memory issues when constructing large dataframes with @chrisdev Any chance this PR could get some attention? I'm using: |
@odoublewen I agree with you. But as I recall, the memory efficiency test failed on some platforms. So the PR
On the other hand it was definitely more efficient on the my local MBP on OS X. I was mystified. I think @wkschwartz had the same experience. Disappointing! But maybe it's time to try again.... |
Sorry for my slow response. I haven't used django-pandas in a couple years now. Someone else should take the reins of this PR. |
This PR is an attempt to abate severe memory pressure from
read_frame
when trying to load a very large table. It is a two-pronged approach.iterator
method. This avoids populating the query set's cache. The typical use case ofread_frame
, I think, causes a query set to be read once and discarded. The cache then just slows things down and takes up memory.compress
argument toread_frame
in the spirit of Stata'scompress
command. This infers the NumPy data types for the data frame's columns from the Django fields being read. For example,compress=True
avoids loading aSmallIntegerField
database column into anint64
data frame column, instead using NumPy'snumpy.int16
dtype, saving six bytes per row. These types of savings are most useful for integer types, but my implementation at least cursorily supports all the built in Django fields. You can also override the defaults by passing a {django-field-type: numpy-dtype} mapping tocompress
.My colleague @shreyasravi may want review this pull request for usefulness to our project, but I thought I'd offer it upstream too.
@chrisdev — If you like this PR, we can help you get it in shape for Python 3.6 and Django 2.0, after which you're free to edit it for backward compatibility (I'm giving you access to edit the PR). I just ask that you check with us before removing features or changing the API (both of which I am open to discuss) prior to merging because even if you ultimately reject the PR, we still need the code and I'd like to point our environment setup code to this branch.
I'm leaving commits unsquashed for now. I'll squash them before merging if you want.