Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading "wide" t-route flow velocity depth csv's has high performance penalty #204

Open
aaraney opened this issue Oct 2, 2024 · 3 comments
Assignees
Labels
ngen.cal Related to ngen.cal package performance Something is slow

Comments

@aaraney
Copy link
Member

aaraney commented Oct 2, 2024

df = pd.read_csv(filepath, index_col=0)

ngen.cal supports reading t-route output in a variety of formats (see #153). One supported format is csv_output. This format contains simulated flow, velocity, and depth values for each waterbody for each t-route timestep. For example:

,"(0, 'q')","(0, 'v')","(0, 'd')","(1, 'q')","(1, 'v')","(1, 'd')"
2420800,0.0,0.0,0.0,0.0,0.0,0.0
t-route csv_output configuration
output_parameters:
  csv_output:
    csv_output_folder: output/

Crucially, this means the longer the simulation time the wider each row will be.

csv parsers like pandas c parser or arrow's csv parser optimize for reading long csv files rather than wide csv files. Both of these parsers use a "chunking" approach where they allocate a buffer, read rows from the csv file into the buffer until its full, and process the data. However, when a row is sufficiently long it cannot fit fully into the buffer. Because of this and other implementation specific details, parsing and deserializing these wide csv files into a pandas.DataFrame can take on the order of minutes. In a local test I found that a csv file with 3 years of 5 minute timestep data (315360 timesteps) took roughly 3.5 minutes to deserialize into a pandas dataframe on an M2 pro macbook.

One potential solution to this is to disable pd.read_csv's low_memory flag:

df = pd.read_csv(filepath, index_col=0, engine="c", low_memory=False)

In local testing it too ~9 seconds to read and deserialize the same file.

For now, my general recommendation is to use t-route's stream_output instead of csv_output if possible. stream_output still supports csv, but instead uses a long format instead of a wide format that does not suffer the same performance penalty. See the most up to date examples of this on the t-route repo or in #153.

@aaraney aaraney added the ngen.cal Related to ngen.cal package label Oct 2, 2024
@aaraney aaraney self-assigned this Oct 2, 2024
@aaraney aaraney added the performance Something is slow label Oct 2, 2024
@aaraney
Copy link
Member Author

aaraney commented Oct 2, 2024

Credit @ajkhattak for reporting this! Thanks 🎉

@hellkite500
Copy link
Contributor

I think we can add this to the t-route config settings as a user flag.

@aaraney
Copy link
Member Author

aaraney commented Oct 2, 2024

@hellkite500, the more I work with the csv flow, depth, velocity output i'm more and more inclined to drop support for it and move it to an example plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ngen.cal Related to ngen.cal package performance Something is slow
Projects
None yet
Development

No branches or pull requests

2 participants