transfer records to another file without decompress/recompress #3070

dkokron · 2025-01-08T20:57:16Z

dkokron
Jan 8, 2025

We have a workflow that copies the last 120 of 121 records from one file to another. The data are compressed and chunked. Each record is a chunk. Profiling shows the vast majority of time is spent decompressing then recompressing the data. A faster way would avoid decompressing and recompressing in the first place. I can see needing to decompress the data if the user wants to get at the real values, but I just want copy from one file to another. I was thinking of a low level block copy. Something like 'dd' command would do. Is that possible with NetCDF?

I'm using nco-5.2.4 built using spack and running on a zen2 chip.
Example usage:
ncrcat -7 -d time,1,120 -L 4 file.in file.out
ncrcat -7 -d time,1,120 --cmp='shf|zst,4' file.in file.out

DennisHeimbigner · 2025-01-09T00:00:08Z

DennisHeimbigner
Jan 9, 2025
Collaborator

Since you are already using NCO, I assume that does not solve your problem.
Assuming the file is stored as a netcdf-4/HDF5 file, you might check the HDF group
website to see if there is some HDF5 tool that can solve the problem.

1 reply

dkokron Jan 9, 2025
Author

Charlie suggested I contact HDF5 too.
https://sourceforge.net/p/nco/discussion/9830/thread/4a675c8d79/

I'll open a discussion with HDF5.

DennisHeimbigner · 2025-01-09T00:31:49Z

DennisHeimbigner
Jan 9, 2025
Collaborator

There is also a python package that wraps HDF5. You might see what it can do.

0 replies

dkokron · 2025-01-09T14:45:28Z

dkokron
Jan 9, 2025
Author

The original workflow is using the python package. I'm exploring other options. I've opened to discussion at https://forum.hdfgroup.org/c/hdf5/8

0 replies

dkokron · 2025-01-13T15:28:25Z

dkokron
Jan 13, 2025
Author

HDF5 responded at https://forum.hdfgroup.org/t/transfer-records-to-another-file-without-decompress-recompress/12960

"H5Dread_chunk() and H5Dwrite_chunk() allow raw chunk data to be accessed and written while bypassing some or all compression filters."

0 replies

edwardhartnett · 2025-01-14T14:43:35Z

edwardhartnett
Jan 14, 2025

OK, if I understand correctly, the idea is to copy one dataset to another (in a different file or the same file or both?) without decompressing/recompressing. Sure, why not.

The problem is, we try to keep the netcdf-c API small, and this would require an extra function. Just to game this out, what would that function look like?

Are we going to always copy the whole variable/dataset? (And subsetting would require decompression, since I don't know where the data are within the compressed chunks). In that case:

int
nc_copy_var(ncid1, varid1, ncid2, varid2);

Is that what we are talking about here?

@czender if such a function existed, could NCO make good use of it? Would it be useful? Presumably NCO allows subsetting when copying, and this only would work when there is no subsetting, right?

Another alternative would be to write a straight-up HDF5 program to do this. As we all know, netCDF-4 just writes regular HDF5 datasets, which can be opened are read with HDF5 as well as netCDF4. Also files created with HDF5 can be opened by netCDF-4. So perhaps that's the easiest path to a working implementation?

Even if we agreed that the above function prototype was correct, there is still the step of convincing the netCDF developers that this is worthy of a new function in the API. It's not clear to me that it is worth it, so the case has to be made.

0 replies

dkokron · 2025-01-14T15:13:06Z

dkokron
Jan 14, 2025
Author

My motivation is helping NCEP improve their resource utilization by speeding up various workflows. One of those workflows spends ~3 hours on the task described in the first posting (copies the last 120 of 121 time records from one file to another). I can only suspect anyone using nccopy would appreciate the speedup too.

What other data/information would make the case?

1 reply

edwardhartnett Jan 14, 2025

What is this were a special feature of nccopy? In other words, an internal function which was not part of the netcdf-c API but which nccopy used when appropriate?

czender · 2025-01-14T15:40:31Z

czender
Jan 14, 2025

If this function
int nc_copy_var(ncid1, varid1, ncid2, varid2);
existed then I would modify NCO to make use of it where possible. The resulting speedup could justify the development effort for this since copying entire variables is common yet unnecessarily time-consuming due to the unavoidable compression/decompression.

However, @dkokron's specific use case above involves hyperslabs so the minimum required prototype to (potentially) eliminate compression/uncompression would be
int nc_copy_vara(ncid1, varid1, ncid2, varid2,start,count);
and I agree with @edwardhartnett that this is problematic since the chunking of the stored variable would make it necessary, in general, to de-/re-compress any chunks that crossed the hyperslab boundaries, and figuring that out would not be a cakewalk.

0 replies

dkokron · 2025-01-14T16:18:47Z

dkokron
Jan 14, 2025
Author

I've attached an ncdump of one of the files that is the subject of my optimization efforts.
dump.txt

0 replies

DennisHeimbigner · 2025-01-14T18:36:31Z

DennisHeimbigner
Jan 14, 2025
Collaborator

Some time ago, I wrote a program to show the chunking layout for HDF5 datasets.
I was using it to test my Zarr implementation. That program is currently part of the
netcdf distribution in the nczarr_test/ncdumpchunks.c program. I have also attached
a copy to this message. It might help inform this discussion. Unfortunately, it is not well documented.
ncdumpchunks.c.txt

0 replies

dkokron · 2025-01-16T18:00:52Z

dkokron
Jan 16, 2025
Author

I put together a prototype code (see attached. pardon the mess) for testing the performance benefit from using H5Dread_chunk/H5Dwrite_chunk. To transfer one variable (wspd in the ncdump output attached above) takes
~22m:30s using the normal H5Dread/H5Dwrite approach (read gzipped, write zstd). ncrcat (version 5.2.4) also took about 22 minutes for the same variable and compression swap.

The H5Dread_chunk/H5Dwrite_chunk approach took 58s. We can't change the compression strategy with this approach.

h5ex_d_chunk.c.txt

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transfer records to another file without decompress/recompress #3070

{{title}}

Replies: 10 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

transfer records to another file without decompress/recompress #3070

dkokron Jan 8, 2025

Replies: 10 comments · 2 replies

DennisHeimbigner Jan 9, 2025 Collaborator

dkokron Jan 9, 2025 Author

DennisHeimbigner Jan 9, 2025 Collaborator

dkokron Jan 9, 2025 Author

dkokron Jan 13, 2025 Author

edwardhartnett Jan 14, 2025

dkokron Jan 14, 2025 Author

edwardhartnett Jan 14, 2025

czender Jan 14, 2025

dkokron Jan 14, 2025 Author

DennisHeimbigner Jan 14, 2025 Collaborator

dkokron Jan 16, 2025 Author

dkokron
Jan 8, 2025

Replies: 10 comments 2 replies

DennisHeimbigner
Jan 9, 2025
Collaborator

dkokron Jan 9, 2025
Author

DennisHeimbigner
Jan 9, 2025
Collaborator

dkokron
Jan 9, 2025
Author

dkokron
Jan 13, 2025
Author

edwardhartnett
Jan 14, 2025

dkokron
Jan 14, 2025
Author

czender
Jan 14, 2025

dkokron
Jan 14, 2025
Author

DennisHeimbigner
Jan 14, 2025
Collaborator

dkokron
Jan 16, 2025
Author