-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large pivot tables get bogged down and can crash the server #2
Comments
As a guardrail, there could be a check for a maximum number of rows to view without issuing a warning. For examples of this, see:
|
Related but not quite: nicolaskruchten/pivottable#559 |
Here is a sample R scripts from @Emsinv9Michael which always crashed his RSession.
|
Problem: Tens of millions of cells → HTML rendering engineThe code in the comment above tries to render a very large table:
I've heard that Excel can handle sheets of a billion cells or more, but it's able to limit rendering for just the part you can see on a screen, and it has constraints that make that tractable (like row heights and column widths that are known ahead of rendering).
I'm not sure if modern HTML rendering engines can pull the same trick of limiting what's shown, but given all the things you could in theory do with CSS inside any arbitrary cell, and nesting of content inside cells, my guess is they can't take shortcuts based on any assumed constraints. Maybe there is a way to tell the rendering engine that it's handling a table with the kinds of constraints that My guess is that 100,000 cells (≈ 300x300) is probably a reasonable upper limit for what we could expect Suggestions for this particular tableIf that table _could be rendered, still, over 99.9% of the cells would be zero. That suggests that a different table structure might help. Use a "tidy" format (don't spread by columns)Maybe this "equivalent" that gets rid of all the empty cells is better suited to the purpose at hand (?). It is sortable, and the name of the pivot table object in the original code was
Precompute countsPrecomputing the counts is an option, but the limiting factor is most likely the number of cells displayed, and indeed the rendering speed doesn't seem to change much if we do that:
This is just a fix for this particular case, but it does help to clarify that sometimes re-casting a table with millions of cells can dodge the issue (and such a table might not even be that helpful ... but of course that depends on the context 🤷♂️). |
Two thoughts:
|
Sorry - I just realized that this is in tbltools. It looks like I have 4 tasks in this repo assigned to me. This is the first one that is user-facing though. Can you tell me how high I should treat the priority of this? If you look at my staff plan I put it at the bottom. But that's just because I've been working on the other 3 items and have a preference to just continue working on them until they're finished. Do you want me to make this top priority? |
Thanks @arilamstein and @dholstius! We have a quick workaround to handle the issue if it come up. Unless it's causing other problems, I'll consider it as low priority. |
This Issue@arilamstein: no, this is low priority for you! The first round of intervention here should be user education, not coding. Project Management / Prioritization Tools & ProtocolsAgreed that tools like Pivotal can really help prioritize across multiple repos. In theory we could be using a @songbai-BAAQMD is starting (I think?) to use a GitHub project, and stacks of cards, to help staff prioritize issues within the |
@arilamstein and @dholstius : Thanks Ari and David for the communication! Yes, we still prefer to use Paper Doc with Ari in the staff plan. For GitHub project, we are going to focus on the point source categories QA under BY2015; for other calculation approaches (e.g., area, special, etc) we are using another Paper Doc table to track and clarify sequence for engineers to follow. |
Approaches to solving #36 — may be more general, so posting here: Short term. Split/subset, and/or aggregate.
Longer term. Developers could:
|
Large pivot tables get bogged down.
The HTML is really large. In part this seems to be because the Javascript gets handed some absurdly precise values — more than ten digits. We don’t need this level of precision. Can we force scientific notation (like 123.45e6) instead?
Sometimes it's bad enough to crash the RStudio server.
The text was updated successfully, but these errors were encountered: