Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to deal with large (many observations) datasets #44

Open
atn38 opened this issue Jun 24, 2019 · 8 comments
Open

how to deal with large (many observations) datasets #44

atn38 opened this issue Jun 24, 2019 · 8 comments
Labels
enhancement New feature or request

Comments

@atn38
Copy link
Member

atn38 commented Jun 24, 2019

With a data table of ~400k observations and 60 variables, the complete static report takes upwards of 10 mins to complete. Does the dynamic plotting functionality have the same challenges? Could we do something with large datasets to reduce load? Randomly sample the dataset then generate report from that sample?

@atn38 atn38 changed the title how to deal with large (many observations) how to deal with large (many observations) datasets Jun 24, 2019
@CoastalPlainSoils
Copy link
Collaborator

Hmmm good question. I have no idea. However, I think if it takes that long to complete, there should definitely be something to let the user know the site is processing the request and if it is possible, the app could give an estimate time frame for completion?!

@sheilasaia sheilasaia added the duplicate This issue or pull request already exists label Jun 25, 2019
@clnsmth
Copy link
Member

clnsmth commented Jun 25, 2019

I'm moving the conversation of #11 here.

@clnsmth
Copy link
Member

clnsmth commented Jun 25, 2019

I suggest this be an optional argument rather than an arbitrary limit to number of rows that can be read in. If performance and wait times are a concern for users, we could address this issue by supplying the user with a status bar (which has proven difficult to do) or inform the user of limitations through an expectation matrix (suggested here) in the package documentation.

@clnsmth clnsmth added enhancement New feature or request and removed duplicate This issue or pull request already exists labels Jun 25, 2019
@clnsmth
Copy link
Member

clnsmth commented Jun 25, 2019

@atn38, since the UI team has figured out how to return messages from a function to the GUI, you could add messages to each static report function to inform the user of status.

Alternatively, as @CoastalPlainSoils suggests, you may be able to create a progress bar using the progress package.

@wetlandscapes
Copy link
Collaborator

I kind of like the idea of being able to randomly sample a large data set. In that context, some useful options would be:

  1. Indicate the % of the dataset (rows) to be explored. There would be an indicator of the resultant rows returned from the sample.
  2. Set a seed. This would allow someone to generate the same report twice.

@sheilasaia
Copy link
Collaborator

sheilasaia commented Jun 27, 2019

add printing to console for report status on data summary tab. @wetlandscapes will give this a go!

@sheilasaia
Copy link
Collaborator

can i also add that we might want to limit the size of the download to someone's computer too? for example, warn them (and maybe stop download) if they're about to download a huge .shp file.

@clnsmth
Copy link
Member

clnsmth commented Jul 12, 2019

I suggest the random sampling and warnings become enhancements to be implemented after the production release. Until then, file size issues can be communicated in the GUI messages and project docs. Note: A user will have to find a data package to use with datapie in DataONE first, where the file size information is clearly presented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants