Improved memory efficiency in read_frame #96

wkschwartz · 2018-03-28T04:16:44Z

This PR is an attempt to abate severe memory pressure from read_frame when trying to load a very large table. It is a two-pronged approach.

Fix Memory-efficient iteration #63 by using QuerySet's iterator method. This avoids populating the query set's cache. The typical use case of read_frame, I think, causes a query set to be read once and discarded. The cache then just slows things down and takes up memory.
Add a compress argument to read_frame in the spirit of Stata's compress command. This infers the NumPy data types for the data frame's columns from the Django fields being read. For example, compress=True avoids loading a SmallIntegerField database column into an int64 data frame column, instead using NumPy's numpy.int16 dtype, saving six bytes per row. These types of savings are most useful for integer types, but my implementation at least cursorily supports all the built in Django fields. You can also override the defaults by passing a {django-field-type: numpy-dtype} mapping to compress.

My colleague @shreyasravi may want review this pull request for usefulness to our project, but I thought I'd offer it upstream too.

@chrisdev — If you like this PR, we can help you get it in shape for Python 3.6 and Django 2.0, after which you're free to edit it for backward compatibility (I'm giving you access to edit the PR). I just ask that you check with us before removing features or changing the API (both of which I am open to discuss) prior to merging because even if you ultimately reject the PR, we still need the code and I'd like to point our environment setup code to this branch.

I'm leaving commits unsquashed for now. I'll squash them before merging if you want.

Also, be specific enough about units for np.dtype('datetime64...'). 'D' is for days and 'us' is for microseconds.

coveralls · 2018-03-28T04:31:42Z

Coverage decreased (-13.4%) to 80.258% when pulling 59319fc on wkschwartz:memory into 498e355 on chrisdev:master.

wkschwartz · 2018-03-28T12:36:47Z

A closer examination of the test coverage report reveals that all the code I’ve added has 100% test coverage. Since I removed no tests, the decline in coverage is unrelated to the PR.

That said, there is no test yet that the query set’s iterator method is used to avoid populating the cache. There are also no benchmarks to ensure that it actually takes less memory. I’ll try to add something for the former test. Open to suggestions on the latter.

Fixed a crash when the query set was the result of qs.values_list()

chrisdev · 2018-03-28T14:09:28Z

@wkschwartz thanks for PR 👍. I think you are on the right track here! We had a PR a few years ago to do it on the Dango side of things but your approach is definitely much better. more detail, but I think we probably need to use some conditionals to keep the tests from failing on the older versions. I'm was bit surprised that the coverage fell so much as the tests that you added were reasonable. Once again thanks for your hard work on this

wkschwartz · 2018-03-28T14:09:35Z

cfd45c8 adds the test I mentioned in my last comment for testing that iterator is used instead of populating the query set's cache. I also fixed a crash when the query set was the result of qs.values_list().

I had to replace one of the two usages of is_values_queryset. I'm not sure I understand the other one well enough, but I wonder if it suffers from the same problem as the one I replaced.

wkschwartz · 2018-03-28T14:11:27Z

Looking at the coverage results reveals that nearly all the missed lines and branches are due to backward compatibility blocks of code.

heliomeiralins · 2018-03-28T14:43:55Z

I feel that the compress option should be the default. Who wouldn't want sane dtypes for free?

edit: after a while I realized that pandas figuring out the dtypes should indeed be the default.

wkschwartz · 2018-03-28T14:58:55Z

I have done no testing to see if the smaller dtypes are free. At least in theory, on some platforms, dtypes narrower than the platform’s pointer size (eg 64 bits on x86_64) are slower to process. I also don’t know how compress interacts with detecting dtypes from unknown field types.

wkschwartz · 2018-03-28T21:25:16Z

TO DO

If I understand correctly, NumPy can handle missing data for floats as NaNs, but doesn't have a clean way of dealing with it for integers. I'll add something to give users control on how to handle that. Probably just casting integers to floats (int32 -> float64, int16 -> float32, and bool -> float16 have no loss of precision).
Test that names come out correctly when loading more complicated columns names such as after a JOIN.
Re-order columns so as not to waste space on alignment (sort columns from wide to narrow)

wkschwartz · 2018-04-02T17:03:27Z

I can't figure out how to test alignment-related padding space waste. Any thoughts?

wkschwartz · 2018-04-05T21:11:36Z

It appears that loading GeoDjango without GDAL installed is failing on Django 1.11 but not on Django 1.10. I wonder if this is a regression in Django 1.11, or am I missing something?

If it doesn't load, then the user doesn't need it. On some versions of Django, trying to import GeoDjango without having GDAL installed is an error.

wkschwartz · 2018-04-13T07:15:31Z

@chrisdev I can't reproduce locally the failures of the tests in Travis. Any hints?

pd.isna was added in version 0.21, but the Travis builds still use Pandas 0.20. np.isnan is already working in these tests, so that should fix the build failure.

chrisdev · 2018-04-16T18:11:55Z

@wkschwartz sorry for taking so long to get back to you. Had some problems with my machine. Yes Tox is sometimes flakey but I've checked out your "memory" branch locally and it passes tests with Python 3.6, Django 2.0.2 and Pandas 0.20.1 (ok there is a deprecation warning and an expected failure? but so far so good). For the older versions of Python I suggest we can do something like this

   try:
        from collections.abc import Mapping
    except ImportError:
         Mapping = dict

@chrisdev

Per request of @chrisdev

wkschwartz · 2018-04-16T20:44:42Z

932852d and 28e9a97 take care of your request.

The deprecation warning appears to be from managers.py, which this PR doesn't touch; I also get the same deprecation warning on master.

The expected failure is documented under "Known Issues" in the docstring for read_frame. If you have an idea for a fix, I'm all ears.

NumPy's fromiter can allow us to skip allocating an intermediate list to create the NumPy NDArray. However, it only works if none of the dtypes are for Python objects, numpy.dtype('O').

wkschwartz · 2018-04-17T00:50:12Z

My guess about the test that is failing

======================================================================
FAIL: test_compress_custom_field (django_pandas.tests.test_io.IOTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/chrisdev/django-pandas/django_pandas/tests/test_io.py", line 121, in test_compress_custom_field
    self.assertLess(df2.memory_usage(deep=True).sum(), read_frame(qs).memory_usage(deep=True).sum())
AssertionError: 947 not less than 929

is that it has to do with running on Linux instead of macOS, or maybe it's not x86_64. One of those things might make the alignment rules different and cause the df2 to use more memory.

wkschwartz · 2018-06-12T16:30:43Z

@chrisdev Get a chance to look at this yet?

Safrone · 2018-06-27T21:23:51Z

very interested in this for some of my my more memory constrained projects

ZuluPro · 2019-01-07T17:18:47Z

Hey @wkschwartz, the projects is removing Django<1.11 compat,

odoublewen · 2020-09-09T18:53:16Z

1.5 years later and I am also hitting memory issues when constructing large dataframes with read_frame. This PR sounds very smart and sensible.

@chrisdev Any chance this PR could get some attention?
@wkschwartz if there is anything I could do to help, testing wise, let me know?

I'm using:
django-pandas-0.6.2
Django-3.1.1
pandas-1.1.1
python 3.7.6

chrisdev · 2020-09-10T14:36:29Z

@odoublewen I agree with you. But as I recall, the memory efficiency test failed on some platforms. So the PR read_frame actually used more memory on AMD 64/Trusty.

 File "/home/travis/build/chrisdev/django-pandas/django_pandas/tests/test_io.py", line 121, in test_compress_custom_field
555    self.assertLess(df2.memory_usage(deep=True).sum(), read_frame(qs).memory_usage(deep=True).sum())
556AssertionError: 947 not less than 929

On the other hand it was definitely more efficient on the my local MBP on OS X. I was mystified. I think @wkschwartz had the same experience.

Disappointing! But maybe it's time to try again....

wkschwartz · 2021-01-05T17:42:12Z

Sorry for my slow response. I haven't used django-pandas in a couple years now. Someone else should take the reins of this PR.

wkschwartz added 10 commits March 27, 2018 09:38

Fix chrisdev#63 and add compress argument to read_frame

fab037d

Simplify implementation of compress

fabc470

Add myself to AUTHORS.rst

d120dc6

Add tests of read_frame(compress=True)

1cad8b6

Pandas has no str, bytes, datetime.time data types

e87b2a6

Also, be specific enough about units for np.dtype('datetime64...'). 'D' is for days and 'us' is for microseconds.

Improve readability of io.py

0907e95

Test that compress really does use less memory

7bcc8dc

Comment typo

d5fb7e6

Test compress argument types and clarify docs

909d1f5

Add this PR's changes to the documentation

4a28342

Test: read_frame doesn't populate cache, handles values_list correctly

cfd45c8

Fixed a crash when the query set was the result of qs.values_list()

Handle nullable fields, repeated column names, coerce_float+compress

bc2f5b2

Drop usage of f-strings for backward compatibility

59319fc

wkschwartz added 3 commits April 12, 2018 22:23

Handle foreign key fields

ea4a056

Fix documentation spelling errors

393b126

Don't require GeoDjango to load properly

00bcf39

If it doesn't load, then the user doesn't need it. On some versions of Django, trying to import GeoDjango without having GDAL installed is an error.

Use np.isnan rather than pd.isna

19453d0

pd.isna was added in version 0.21, but the Travis builds still use Pandas 0.20. np.isnan is already working in these tests, so that should fix the build failure.

Backward compatibility for collections.abc.Mapping

932852d

Per request of @chrisdev

Do not do coverage analysis of backward compatibility

28e9a97

Avoid allocating intermediate list if possible

b9bcec6

NumPy's fromiter can allow us to skip allocating an intermediate list to create the NumPy NDArray. However, it only works if none of the dtypes are for Python objects, numpy.dtype('O').

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved memory efficiency in read_frame #96

Improved memory efficiency in read_frame #96

wkschwartz commented Mar 28, 2018

coveralls commented Mar 28, 2018 •

edited

Loading

wkschwartz commented Mar 28, 2018

chrisdev commented Mar 28, 2018

wkschwartz commented Mar 28, 2018

wkschwartz commented Mar 28, 2018

heliomeiralins commented Mar 28, 2018 •

edited

Loading

wkschwartz commented Mar 28, 2018

wkschwartz commented Mar 28, 2018 •

edited

Loading

wkschwartz commented Apr 2, 2018

wkschwartz commented Apr 5, 2018

wkschwartz commented Apr 13, 2018

chrisdev commented Apr 16, 2018

wkschwartz commented Apr 16, 2018

wkschwartz commented Apr 17, 2018

wkschwartz commented Jun 12, 2018

Safrone commented Jun 27, 2018

ZuluPro commented Jan 7, 2019

odoublewen commented Sep 9, 2020

chrisdev commented Sep 10, 2020

wkschwartz commented Jan 5, 2021

Improved memory efficiency in read_frame #96

Are you sure you want to change the base?

Improved memory efficiency in read_frame #96

Conversation

wkschwartz commented Mar 28, 2018

coveralls commented Mar 28, 2018 • edited Loading

wkschwartz commented Mar 28, 2018

chrisdev commented Mar 28, 2018

wkschwartz commented Mar 28, 2018

wkschwartz commented Mar 28, 2018

heliomeiralins commented Mar 28, 2018 • edited Loading

wkschwartz commented Mar 28, 2018

wkschwartz commented Mar 28, 2018 • edited Loading

TO DO

wkschwartz commented Apr 2, 2018

wkschwartz commented Apr 5, 2018

wkschwartz commented Apr 13, 2018

chrisdev commented Apr 16, 2018

wkschwartz commented Apr 16, 2018

wkschwartz commented Apr 17, 2018

wkschwartz commented Jun 12, 2018

Safrone commented Jun 27, 2018

ZuluPro commented Jan 7, 2019

odoublewen commented Sep 9, 2020

chrisdev commented Sep 10, 2020

wkschwartz commented Jan 5, 2021

coveralls commented Mar 28, 2018 •

edited

Loading

heliomeiralins commented Mar 28, 2018 •

edited

Loading

wkschwartz commented Mar 28, 2018 •

edited

Loading