Large pivot tables get bogged down and can crash the server #2

dholstius · 2019-09-07T17:48:24Z

Large pivot tables get bogged down.

The HTML is really large. In part this seems to be because the Javascript gets handed some absurdly precise values — more than ten digits. We don’t need this level of precision. Can we force scientific notation (like 123.45e6) instead?
Sometimes it's bad enough to crash the RStudio server.

dholstius · 2021-04-20T16:11:41Z

As a guardrail, there could be a check for a maximum number of rows to view without issuing a warning.

For examples of this, see:

tbltools:::view() (not exported); and
tibble::view()

dholstius · 2021-04-20T16:13:53Z

Related but not quite: nicolaskruchten/pivottable#559

ydu0316 · 2021-04-24T18:57:07Z

Here is a sample R scripts from @Emsinv9Michael which always crashed his RSession.

# QA_Request\QA_CEIDARS_RY2019\Killed_QA_SCC_SIC_RY2019.R, Apr 14, 2021
library(inventory) 
# library(BY2015)


BAP19FAC <- read_csv("Dropbox (BAAQMD)/CEIDARS-RY2019/SendToCARB/Submitted20201029/BAP19FAC.csv")
BAP19PRO <- read_csv("Dropbox (BAAQMD)/CEIDARS-RY2019/SendToCARB/Submitted20201029/BAP19PRO.csv")


BAP19FAC <- read_csv("Dropbox (BAAQMD)/CEIDARS-RY2019/SendToCARB/Submitted20201029/BAP19FAC.csv")%>%
  select(FACID,FNAME,FSTREET,FCITY,FZIP,FSIC,FNAICS,fac_UTM_E=X_USERCOORD,fac_UTM_N=Y_USERCOORD) %>%
  unite("fac_addr",FSTREET,FCITY,FZIP, sep = ", ") 


BAP19PRO_SCC <- BAP19PRO %>% 
  select(FACID,DEV,PROID, PRDESC,SCC,SIC,NAICS,PROD1,MEMO_PR) %>%
  mutate(SCC1=SCC) %>%
  mutate(SIC1=SIC) %>%
  unite("SCC_SIC",SCC1,SIC1, sep = "") %>%
  mutate(SCC_SIC=as.numeric(SCC_SIC)) %>%
  inner_join(BAP19FAC) %>%
  select(FACID,FNAME,fac_addr,DEV,src_code=MEMO_PR,PROID:NAICS,cat_id=PROD1,SCC_SIC)


# Pivot Table
Top_pivot_table <- BAP19PRO_SCC %>%
  pivot_table(renderer = "Heatmap",
              aggregator = "Count",
              values = c("SCC_SIC"),
              #    rows = c("fac_id","fac_name","addr_street","addr_city","addr_state","addr_zip","contact_name",
              #             "contact_title","contact_phone","contact_email","pmt_staff_now"),
              rows = c("FACID","FNAME"),
              columns = c("SCC_SIC"),
              height = 1000,
              menuLimit = 3000,
              inclusions = list(
                pol_category = list("Top250tpy","EMS100_250tpy")),
              exclusions = list(is_duplicate = list("true")),
              autoSortUnusedAttrs = TRUE)

Top_pivot_table

dholstius · 2021-04-26T16:43:16Z

Problem: Tens of millions of cells → HTML rendering engine

The code in the comment above tries to render a very large table:

almost 10,000 rows (determined by FACID and FNAME)
almost 4,000 columns (determined by SCC_SIC)

I've heard that Excel can handle sheets of a billion cells or more, but it's able to limit rendering for just the part you can see on a screen, and it has constraints that make that tractable (like row heights and column widths that are known ahead of rendering).

pivot_table() has to generate HTML for every possible cell, to hand off to a separate rendering engine. So that part takes time, but my guess is that it's the rendering engine that is really choking.

I'm not sure if modern HTML rendering engines can pull the same trick of limiting what's shown, but given all the things you could in theory do with CSS inside any arbitrary cell, and nesting of content inside cells, my guess is they can't take shortcuts based on any assumed constraints. Maybe there is a way to tell the rendering engine that it's handling a table with the kinds of constraints that pivot_table()'s JavaScript engine enforces—for example, all HTML in cells is escaped—but I don't know how.

My guess is that 100,000 cells (≈ 300x300) is probably a reasonable upper limit for what we could expect pivot_table() to render comfortably.

Suggestions for this particular table

If that table _could be rendered, still, over 99.9% of the cells would be zero. That suggests that a different table structure might help.

Use a "tidy" format (don't spread by columns)

Maybe this "equivalent" that gets rid of all the empty cells is better suited to the purpose at hand (?). It is sortable, and the name of the pivot table object in the original code was Top_pivot_table. That's all I can really guess at without seeing comments in the code, though.

BAP19PRO_SCC %>%
  pivot_table(
    aggregator = "Count",
    rows = c("FACID", "FNAME", "SIC", "SCC"))

Precompute counts

Precomputing the counts is an option, but the limiting factor is most likely the number of cells displayed, and indeed the rendering speed doesn't seem to change much if we do that:

# Try precomputing the counts for each non-empty cell.
pivot_table_data <-
  BAP19PRO_SCC %>%
  count(
    FACID, FNAME, SCC, SIC) 

# Instead of using `aggregator = "Count"`, we can use `aggregator = "Integer Sum"` 
# and ask it to just sum the precomputed counts (`values = "n"`).
pivot_table_data %>%
  pivot_table(
    values = "n",
    aggregator = "Integer Sum",
    rows = c("FACID", "FNAME", "SIC", "SCC"))

This is just a fix for this particular case, but it does help to clarify that sometimes re-casting a table with millions of cells can dodge the issue (and such a table might not even be that helpful ... but of course that depends on the context 🤷‍♂️).

arilamstein · 2021-04-26T18:03:33Z

Two thoughts:

@dholstius I have not looked at this yet because I am not sure how to prioritize it. A tool like Pivotal is nice because it lets you prioritize things between repos, whereas this task seems to have gotten lost in the shuffle last week between my work on tbltools and my work on data versioning. If you'd like me to work on this this week, can you add it to my Staff Plan.
My first thought is that this might not be "crashing the server" as the title says. Rather, it might be crashing the end user's browser. I believe that rPivotTable just creates JS that runs in the client's browser. Modern browsers might just not be designed to handle that much data.

arilamstein · 2021-04-26T18:08:41Z

Sorry - I just realized that this is in tbltools. It looks like I have 4 tasks in this repo assigned to me. This is the first one that is user-facing though. Can you tell me how high I should treat the priority of this?

If you look at my staff plan I put it at the bottom. But that's just because I've been working on the other 3 items and have a preference to just continue working on them until they're finished.

Do you want me to make this top priority?

ydu0316 · 2021-04-27T16:55:13Z

Thanks @arilamstein and @dholstius! We have a quick workaround to handle the issue if it come up. Unless it's causing other problems, I'll consider it as low priority.

dholstius · 2021-04-27T17:36:13Z

This Issue

@arilamstein: no, this is low priority for you! The first round of intervention here should be user education, not coding.

Project Management / Prioritization Tools & Protocols

Agreed that tools like Pivotal can really help prioritize across multiple repos.

In theory we could be using a BY2015 GitHub Project—not belonging to the BY2015-methodology repo, but rather having several of our (BAAQMD) GitHub repos associated with it—and doing some sort of kanban with the cards.

@songbai-BAAQMD is starting (I think?) to use a GitHub project, and stacks of cards, to help staff prioritize issues within the BY2015-methodology repo. But I think @abhinavguha and @songbai-BAAQMD are more oriented towards working with Paper for task management with you. I am platform-agnostic, but I do strongly prefer to keep the number of platforms down.

songbai-BAAQMD · 2021-04-27T18:35:13Z

@arilamstein and @dholstius : Thanks Ari and David for the communication! Yes, we still prefer to use Paper Doc with Ari in the staff plan. For GitHub project, we are going to focus on the point source categories QA under BY2015; for other calculation approaches (e.g., area, special, etc) we are using another Paper Doc table to track and clarify sequence for engineers to follow.

dholstius · 2021-05-24T21:20:53Z

Approaches to solving #36 — may be more general, so posting here:

Short term. Split/subset, and/or aggregate.

Split by some dimension(s). For example, split into several tables, one for each pollutant. Or split by cat_h1 or cat_h2. Or both at once. Or history vs future.
Subset. Maybe you only want every 5th year, if that's good enough for QA.
Aggregate. Aggregate to cat_h4 (maybe?) rather than cat_id. If that's good enough.

Longer term. Developers could:

Find and use a more efficient Javascript pivot table library as the backend for tbltools::pivot_table(). A new & better one may have been invented in the past 3-4 years.
Save on HTML size by truncating the digits that get written to encode the floating-point numbers in the HTML. We usually aren't handling numbers that are better than 0.1% (four digits) precision. If 20 digits get written by default, that might be worth trimming. OTOH, most of the HTML might be other stuff, like strings.

dholstius added the enhancement New feature or request label Sep 7, 2019

dholstius added the help wanted Extra attention is needed label Mar 6, 2020

dholstius changed the title ~~Large pivot tables get bogged down~~ Large pivot tables get bogged down and can crash the server Apr 20, 2021

dholstius assigned arilamstein Apr 20, 2021

dholstius mentioned this issue Apr 26, 2021

add read_*TRANS() functions BAAQMD/CEIDARS#1

Merged

arilamstein mentioned this issue May 20, 2021

Error: pandoc document conversion failed with error 137 #36

Closed

arilamstein removed their assignment Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large pivot tables get bogged down and can crash the server #2

Large pivot tables get bogged down and can crash the server #2

dholstius commented Sep 7, 2019 •

edited

Loading

dholstius commented Apr 20, 2021 •

edited

Loading

dholstius commented Apr 20, 2021 •

edited

Loading

ydu0316 commented Apr 24, 2021

dholstius commented Apr 26, 2021 •

edited

Loading

arilamstein commented Apr 26, 2021

arilamstein commented Apr 26, 2021

ydu0316 commented Apr 27, 2021 •

edited by arihusain

Loading

dholstius commented Apr 27, 2021 •

edited

Loading

songbai-BAAQMD commented Apr 27, 2021

dholstius commented May 24, 2021 •

edited

Loading

Large pivot tables get bogged down and can crash the server #2

Large pivot tables get bogged down and can crash the server #2

Comments

dholstius commented Sep 7, 2019 • edited Loading

dholstius commented Apr 20, 2021 • edited Loading

dholstius commented Apr 20, 2021 • edited Loading

ydu0316 commented Apr 24, 2021

dholstius commented Apr 26, 2021 • edited Loading

Problem: Tens of millions of cells → HTML rendering engine

Suggestions for this particular table

Use a "tidy" format (don't spread by columns)

Precompute counts

arilamstein commented Apr 26, 2021

arilamstein commented Apr 26, 2021

ydu0316 commented Apr 27, 2021 • edited by arihusain Loading

dholstius commented Apr 27, 2021 • edited Loading

This Issue

Project Management / Prioritization Tools & Protocols

songbai-BAAQMD commented Apr 27, 2021

dholstius commented May 24, 2021 • edited Loading

dholstius commented Sep 7, 2019 •

edited

Loading

dholstius commented Apr 20, 2021 •

edited

Loading

dholstius commented Apr 20, 2021 •

edited

Loading

dholstius commented Apr 26, 2021 •

edited

Loading

ydu0316 commented Apr 27, 2021 •

edited by arihusain

Loading

dholstius commented Apr 27, 2021 •

edited

Loading

dholstius commented May 24, 2021 •

edited

Loading