Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large pivot tables get bogged down and can crash the server #2

Open
dholstius opened this issue Sep 7, 2019 · 10 comments
Open

Large pivot tables get bogged down and can crash the server #2

dholstius opened this issue Sep 7, 2019 · 10 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@dholstius
Copy link
Member

dholstius commented Sep 7, 2019

Large pivot tables get bogged down.

  • The HTML is really large. In part this seems to be because the Javascript gets handed some absurdly precise values — more than ten digits. We don’t need this level of precision. Can we force scientific notation (like 123.45e6) instead?

  • Sometimes it's bad enough to crash the RStudio server.

@dholstius dholstius added the enhancement New feature or request label Sep 7, 2019
@dholstius dholstius added the help wanted Extra attention is needed label Mar 6, 2020
@dholstius
Copy link
Member Author

dholstius commented Apr 20, 2021

As a guardrail, there could be a check for a maximum number of rows to view without issuing a warning.

For examples of this, see:

  • tbltools:::view() (not exported); and
  • tibble::view()

@dholstius
Copy link
Member Author

dholstius commented Apr 20, 2021

Related but not quite: nicolaskruchten/pivottable#559

@dholstius dholstius changed the title Large pivot tables get bogged down Large pivot tables get bogged down and can crash the server Apr 20, 2021
@ydu0316
Copy link
Contributor

ydu0316 commented Apr 24, 2021

Here is a sample R scripts from @Emsinv9Michael which always crashed his RSession.

# QA_Request\QA_CEIDARS_RY2019\Killed_QA_SCC_SIC_RY2019.R, Apr 14, 2021
library(inventory) 
# library(BY2015)


BAP19FAC <- read_csv("Dropbox (BAAQMD)/CEIDARS-RY2019/SendToCARB/Submitted20201029/BAP19FAC.csv")
BAP19PRO <- read_csv("Dropbox (BAAQMD)/CEIDARS-RY2019/SendToCARB/Submitted20201029/BAP19PRO.csv")


BAP19FAC <- read_csv("Dropbox (BAAQMD)/CEIDARS-RY2019/SendToCARB/Submitted20201029/BAP19FAC.csv")%>%
  select(FACID,FNAME,FSTREET,FCITY,FZIP,FSIC,FNAICS,fac_UTM_E=X_USERCOORD,fac_UTM_N=Y_USERCOORD) %>%
  unite("fac_addr",FSTREET,FCITY,FZIP, sep = ", ") 


BAP19PRO_SCC <- BAP19PRO %>% 
  select(FACID,DEV,PROID, PRDESC,SCC,SIC,NAICS,PROD1,MEMO_PR) %>%
  mutate(SCC1=SCC) %>%
  mutate(SIC1=SIC) %>%
  unite("SCC_SIC",SCC1,SIC1, sep = "") %>%
  mutate(SCC_SIC=as.numeric(SCC_SIC)) %>%
  inner_join(BAP19FAC) %>%
  select(FACID,FNAME,fac_addr,DEV,src_code=MEMO_PR,PROID:NAICS,cat_id=PROD1,SCC_SIC)


# Pivot Table
Top_pivot_table <- BAP19PRO_SCC %>%
  pivot_table(renderer = "Heatmap",
              aggregator = "Count",
              values = c("SCC_SIC"),
              #    rows = c("fac_id","fac_name","addr_street","addr_city","addr_state","addr_zip","contact_name",
              #             "contact_title","contact_phone","contact_email","pmt_staff_now"),
              rows = c("FACID","FNAME"),
              columns = c("SCC_SIC"),
              height = 1000,
              menuLimit = 3000,
              inclusions = list(
                pol_category = list("Top250tpy","EMS100_250tpy")),
              exclusions = list(is_duplicate = list("true")),
              autoSortUnusedAttrs = TRUE)

Top_pivot_table


@dholstius
Copy link
Member Author

dholstius commented Apr 26, 2021

Problem: Tens of millions of cells → HTML rendering engine

The code in the comment above tries to render a very large table:

  • almost 10,000 rows (determined by FACID and FNAME)
  • almost 4,000 columns (determined by SCC_SIC)

I've heard that Excel can handle sheets of a billion cells or more, but it's able to limit rendering for just the part you can see on a screen, and it has constraints that make that tractable (like row heights and column widths that are known ahead of rendering).

pivot_table() has to generate HTML for every possible cell, to hand off to a separate rendering engine. So that part takes time, but my guess is that it's the rendering engine that is really choking.

I'm not sure if modern HTML rendering engines can pull the same trick of limiting what's shown, but given all the things you could in theory do with CSS inside any arbitrary cell, and nesting of content inside cells, my guess is they can't take shortcuts based on any assumed constraints. Maybe there is a way to tell the rendering engine that it's handling a table with the kinds of constraints that pivot_table()'s JavaScript engine enforces—for example, all HTML in cells is escaped—but I don't know how.

My guess is that 100,000 cells (≈ 300x300) is probably a reasonable upper limit for what we could expect pivot_table() to render comfortably.

Suggestions for this particular table

If that table _could be rendered, still, over 99.9% of the cells would be zero. That suggests that a different table structure might help.

Use a "tidy" format (don't spread by columns)

Maybe this "equivalent" that gets rid of all the empty cells is better suited to the purpose at hand (?). It is sortable, and the name of the pivot table object in the original code was Top_pivot_table. That's all I can really guess at without seeing comments in the code, though.

BAP19PRO_SCC %>%
  pivot_table(
    aggregator = "Count",
    rows = c("FACID", "FNAME", "SIC", "SCC"))

image

Precompute counts

Precomputing the counts is an option, but the limiting factor is most likely the number of cells displayed, and indeed the rendering speed doesn't seem to change much if we do that:

# Try precomputing the counts for each non-empty cell.
pivot_table_data <-
  BAP19PRO_SCC %>%
  count(
    FACID, FNAME, SCC, SIC) 

# Instead of using `aggregator = "Count"`, we can use `aggregator = "Integer Sum"` 
# and ask it to just sum the precomputed counts (`values = "n"`).
pivot_table_data %>%
  pivot_table(
    values = "n",
    aggregator = "Integer Sum",
    rows = c("FACID", "FNAME", "SIC", "SCC"))

This is just a fix for this particular case, but it does help to clarify that sometimes re-casting a table with millions of cells can dodge the issue (and such a table might not even be that helpful ... but of course that depends on the context 🤷‍♂️).

@arilamstein
Copy link
Contributor

Two thoughts:

  1. @dholstius I have not looked at this yet because I am not sure how to prioritize it. A tool like Pivotal is nice because it lets you prioritize things between repos, whereas this task seems to have gotten lost in the shuffle last week between my work on tbltools and my work on data versioning. If you'd like me to work on this this week, can you add it to my Staff Plan.
  2. My first thought is that this might not be "crashing the server" as the title says. Rather, it might be crashing the end user's browser. I believe that rPivotTable just creates JS that runs in the client's browser. Modern browsers might just not be designed to handle that much data.

@arilamstein
Copy link
Contributor

Sorry - I just realized that this is in tbltools. It looks like I have 4 tasks in this repo assigned to me. This is the first one that is user-facing though. Can you tell me how high I should treat the priority of this?

If you look at my staff plan I put it at the bottom. But that's just because I've been working on the other 3 items and have a preference to just continue working on them until they're finished.

Do you want me to make this top priority?

@ydu0316
Copy link
Contributor

ydu0316 commented Apr 27, 2021

Thanks @arilamstein and @dholstius! We have a quick workaround to handle the issue if it come up. Unless it's causing other problems, I'll consider it as low priority.

@dholstius
Copy link
Member Author

dholstius commented Apr 27, 2021

This Issue

@arilamstein: no, this is low priority for you! The first round of intervention here should be user education, not coding.

Project Management / Prioritization Tools & Protocols

Agreed that tools like Pivotal can really help prioritize across multiple repos.

In theory we could be using a BY2015 GitHub Project—not belonging to the BY2015-methodology repo, but rather having several of our (BAAQMD) GitHub repos associated with it—and doing some sort of kanban with the cards.

@songbai-BAAQMD is starting (I think?) to use a GitHub project, and stacks of cards, to help staff prioritize issues within the BY2015-methodology repo. But I think @abhinavguha and @songbai-BAAQMD are more oriented towards working with Paper for task management with you. I am platform-agnostic, but I do strongly prefer to keep the number of platforms down.

@songbai-BAAQMD
Copy link

@arilamstein and @dholstius : Thanks Ari and David for the communication! Yes, we still prefer to use Paper Doc with Ari in the staff plan. For GitHub project, we are going to focus on the point source categories QA under BY2015; for other calculation approaches (e.g., area, special, etc) we are using another Paper Doc table to track and clarify sequence for engineers to follow.

@dholstius
Copy link
Member Author

dholstius commented May 24, 2021

Approaches to solving #36 — may be more general, so posting here:

Short term. Split/subset, and/or aggregate.

  • Split by some dimension(s). For example, split into several tables, one for each pollutant. Or split by cat_h1 or cat_h2. Or both at once. Or history vs future.
  • Subset. Maybe you only want every 5th year, if that's good enough for QA.
  • Aggregate. Aggregate to cat_h4 (maybe?) rather than cat_id. If that's good enough.

Longer term. Developers could:

  • Find and use a more efficient Javascript pivot table library as the backend for tbltools::pivot_table(). A new & better one may have been invented in the past 3-4 years.
  • Save on HTML size by truncating the digits that get written to encode the floating-point numbers in the HTML. We usually aren't handling numbers that are better than 0.1% (four digits) precision. If 20 digits get written by default, that might be worth trimming. OTOH, most of the HTML might be other stuff, like strings.

@arilamstein arilamstein removed their assignment Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants