Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved memory efficiency in read_frame #96

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

wkschwartz
Copy link

This PR is an attempt to abate severe memory pressure from read_frame when trying to load a very large table. It is a two-pronged approach.

  • Fix Memory-efficient iteration #63 by using QuerySet's iterator method. This avoids populating the query set's cache. The typical use case of read_frame, I think, causes a query set to be read once and discarded. The cache then just slows things down and takes up memory.
  • Add a compress argument to read_frame in the spirit of Stata's compress command. This infers the NumPy data types for the data frame's columns from the Django fields being read. For example, compress=True avoids loading a SmallIntegerField database column into an int64 data frame column, instead using NumPy's numpy.int16 dtype, saving six bytes per row. These types of savings are most useful for integer types, but my implementation at least cursorily supports all the built in Django fields. You can also override the defaults by passing a {django-field-type: numpy-dtype} mapping to compress.

My colleague @shreyasravi may want review this pull request for usefulness to our project, but I thought I'd offer it upstream too.

@chrisdev — If you like this PR, we can help you get it in shape for Python 3.6 and Django 2.0, after which you're free to edit it for backward compatibility (I'm giving you access to edit the PR). I just ask that you check with us before removing features or changing the API (both of which I am open to discuss) prior to merging because even if you ultimately reject the PR, we still need the code and I'd like to point our environment setup code to this branch.

I'm leaving commits unsquashed for now. I'll squash them before merging if you want.

@coveralls
Copy link

coveralls commented Mar 28, 2018

Coverage Status

Coverage decreased (-13.4%) to 80.258% when pulling 59319fc on wkschwartz:memory into 498e355 on chrisdev:master.

@wkschwartz
Copy link
Author

A closer examination of the test coverage report reveals that all the code I’ve added has 100% test coverage. Since I removed no tests, the decline in coverage is unrelated to the PR.

That said, there is no test yet that the query set’s iterator method is used to avoid populating the cache. There are also no benchmarks to ensure that it actually takes less memory. I’ll try to add something for the former test. Open to suggestions on the latter.

Fixed a crash when the query set was the result of qs.values_list()
@chrisdev
Copy link
Owner

@wkschwartz thanks for PR 👍. I think you are on the right track here! We had a PR a few years ago to do it on the Dango side of things but your approach is definitely much better. more detail, but I think we probably need to use some conditionals to keep the tests from failing on the older versions. I'm was bit surprised that the coverage fell so much as the tests that you added were reasonable. Once again thanks for your hard work on this

@wkschwartz
Copy link
Author

cfd45c8 adds the test I mentioned in my last comment for testing that iterator is used instead of populating the query set's cache. I also fixed a crash when the query set was the result of qs.values_list().

I had to replace one of the two usages of is_values_queryset. I'm not sure I understand the other one well enough, but I wonder if it suffers from the same problem as the one I replaced.

@wkschwartz
Copy link
Author

Looking at the coverage results reveals that nearly all the missed lines and branches are due to backward compatibility blocks of code.

@heliomeiralins
Copy link
Contributor

heliomeiralins commented Mar 28, 2018

I feel that the compress option should be the default. Who wouldn't want sane dtypes for free?

edit: after a while I realized that pandas figuring out the dtypes should indeed be the default.

@wkschwartz
Copy link
Author

I have done no testing to see if the smaller dtypes are free. At least in theory, on some platforms, dtypes narrower than the platform’s pointer size (eg 64 bits on x86_64) are slower to process. I also don’t know how compress interacts with detecting dtypes from unknown field types.

@wkschwartz
Copy link
Author

wkschwartz commented Mar 28, 2018

TO DO

  • If I understand correctly, NumPy can handle missing data for floats as NaNs, but doesn't have a clean way of dealing with it for integers. I'll add something to give users control on how to handle that. Probably just casting integers to floats (int32 -> float64, int16 -> float32, and bool -> float16 have no loss of precision).
  • Test that names come out correctly when loading more complicated columns names such as after a JOIN.
  • Re-order columns so as not to waste space on alignment (sort columns from wide to narrow)

@wkschwartz
Copy link
Author

I can't figure out how to test alignment-related padding space waste. Any thoughts?

@wkschwartz
Copy link
Author

It appears that loading GeoDjango without GDAL installed is failing on Django 1.11 but not on Django 1.10. I wonder if this is a regression in Django 1.11, or am I missing something?

If it doesn't load, then the user doesn't need it. On some versions of
Django, trying to import GeoDjango without having GDAL installed is an
error.
@wkschwartz
Copy link
Author

@chrisdev I can't reproduce locally the failures of the tests in Travis. Any hints?

pd.isna was added in version 0.21, but the Travis builds still use
Pandas 0.20. np.isnan is already working in these tests, so that should
fix the build failure.
@chrisdev
Copy link
Owner

@wkschwartz sorry for taking so long to get back to you. Had some problems with my machine. Yes Tox is sometimes flakey but I've checked out your "memory" branch locally and it passes tests with Python 3.6, Django 2.0.2 and Pandas 0.20.1 (ok there is a deprecation warning and an expected failure? but so far so good). For the older versions of Python I suggest we can do something like this

   try:
        from collections.abc import Mapping
    except ImportError:
         Mapping = dict

@wkschwartz
Copy link
Author

932852d and 28e9a97 take care of your request.

The deprecation warning appears to be from managers.py, which this PR doesn't touch; I also get the same deprecation warning on master.

The expected failure is documented under "Known Issues" in the docstring for read_frame. If you have an idea for a fix, I'm all ears.

NumPy's fromiter can allow us to skip allocating an intermediate list to
create the NumPy NDArray. However, it only works if none of the dtypes
are for Python objects, numpy.dtype('O').
@wkschwartz
Copy link
Author

My guess about the test that is failing

======================================================================
FAIL: test_compress_custom_field (django_pandas.tests.test_io.IOTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/chrisdev/django-pandas/django_pandas/tests/test_io.py", line 121, in test_compress_custom_field
    self.assertLess(df2.memory_usage(deep=True).sum(), read_frame(qs).memory_usage(deep=True).sum())
AssertionError: 947 not less than 929

is that it has to do with running on Linux instead of macOS, or maybe it's not x86_64. One of those things might make the alignment rules different and cause the df2 to use more memory.

@wkschwartz
Copy link
Author

@chrisdev Get a chance to look at this yet?

@Safrone
Copy link

Safrone commented Jun 27, 2018

very interested in this for some of my my more memory constrained projects

@ZuluPro
Copy link
Contributor

ZuluPro commented Jan 7, 2019

Hey @wkschwartz, the projects is removing Django<1.11 compat,

@odoublewen
Copy link

1.5 years later and I am also hitting memory issues when constructing large dataframes with read_frame. This PR sounds very smart and sensible.

@chrisdev Any chance this PR could get some attention?
@wkschwartz if there is anything I could do to help, testing wise, let me know?

I'm using:
django-pandas-0.6.2
Django-3.1.1
pandas-1.1.1
python 3.7.6

@chrisdev
Copy link
Owner

@odoublewen I agree with you. But as I recall, the memory efficiency test failed on some platforms. So the PR read_frame actually used more memory on AMD 64/Trusty.

 File "/home/travis/build/chrisdev/django-pandas/django_pandas/tests/test_io.py", line 121, in test_compress_custom_field
555    self.assertLess(df2.memory_usage(deep=True).sum(), read_frame(qs).memory_usage(deep=True).sum())
556AssertionError: 947 not less than 929

On the other hand it was definitely more efficient on the my local MBP on OS X. I was mystified. I think @wkschwartz had the same experience.

Disappointing! But maybe it's time to try again....

@wkschwartz
Copy link
Author

Sorry for my slow response. I haven't used django-pandas in a couple years now. Someone else should take the reins of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory-efficient iteration
7 participants