A quick append mode for already existing hdf5 files #146

sherjeelshabih · 2023-08-23T13:45:09Z

This is in line with what @mkuehbach brought up. I just cooked up some basic draft to share ideas on this.

One of the questions we had was what to do if we are adding new data to an existing hdf5 path in the file. We decided to offer a simple (y/n) prompt for the overwrite. But this can be discussed.

If there is anything else you would like to add please feel free to work on this branch or leave comments.

@domna @RubelMozumder @sanbrock @lukaspie

coveralls · 2023-08-23T13:48:10Z

Pull Request Test Coverage Report for Build 5952173941

18 of 18 (100.0%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.07%) to 49.569%

Totals
Change from base Build 5715296537:	0.07%
Covered Lines:	5405
Relevant Lines:	10904

💛 - Coveralls

lukaspie · 2023-09-19T09:31:20Z

I fully support having the possibility to append data. Maybe this could also work nicely with the ideas of Area A that were discussed yesterday, i.e., allowing for appending ELN data individually (e.g., from a python dict)? In any way, the proposed changes lgtm.

Overwriting old data seems a bit more tricky since you wouldn't want to accidentally overwrite existing data (especially if the writing is automated, e.g. during exporting of experimental data), A y/n prompt may be a good compromise.

mkuehbach · 2023-09-19T10:50:33Z

I am strongly against adding an option to overwrite existent data especially if there is not a rock-solid mechanism in-place which keeps hashes and provenance of each piece of information or groupings of pieces of information.

sherjeelshabih · 2023-09-19T11:13:54Z

I am strongly against adding an option to overwrite existent data especially if there is not a rock-solid mechanism in-place which keeps hashes and provenance of each piece of information or groupings of pieces of information.

I understand. But the problem is if a user wants to replace this, do we block them from doing so and ignore the replacing keys for now? So no asking (y/n) and just warning them about it?

They could always go around our code and replace anything they like. in the h5 files. So I feel like giving the user more power to outright replace what they like using our framework is better.

domna · 2023-09-19T12:02:55Z

I am strongly against adding an option to overwrite existent data especially if there is not a rock-solid mechanism in-place which keeps hashes and provenance of each piece of information or groupings of pieces of information.

I understand. But the problem is if a user wants to replace this, do we block them from doing so and ignore the replacing keys for now? So no asking (y/n) and just warning them about it?

They could always go around our code and replace anything they like. in the h5 files. So I feel like giving the user more power to outright replace what they like using our framework is better.

I see the point. People will do it if they really want to. Maybe we could do something like the --i-am-really-sure flag we have in the nomad cli? Something which you actively need to add, instead of maybe accidentally hitting y out of habit?

sherjeelshabih · 2023-09-19T12:11:35Z

I am strongly against adding an option to overwrite existent data especially if there is not a rock-solid mechanism in-place which keeps hashes and provenance of each piece of information or groupings of pieces of information.

I understand. But the problem is if a user wants to replace this, do we block them from doing so and ignore the replacing keys for now? So no asking (y/n) and just warning them about it?
They could always go around our code and replace anything they like. in the h5 files. So I feel like giving the user more power to outright replace what they like using our framework is better.

I see the point. People will do it if they really want to. Maybe we could do something like the --i-am-really-sure flag we have in the nomad cli? Something which you actively need to add, instead of maybe accidentally hitting y out of habit?

I'm on board with that. Do we make it work just like Nomad where the user has to run the program again with this flag or do we let the user type in "I am really sure!". Note I capitalized the "I" and added an exclamation mark there to make for a stronger approval.

RubelMozumder · 2023-09-19T12:34:00Z

I am strongly against adding an option to overwrite existent data especially if there is not a rock-solid mechanism in-place which keeps hashes and provenance of each piece of information or groupings of pieces of information.

I understand. But the problem is if a user wants to replace this, do we block them from doing so and ignore the replacing keys for now? So no asking (y/n) and just warning them about it?
They could always go around our code and replace anything they like. in the h5 files. So I feel like giving the user more power to outright replace what they like using our framework is better.

I see the point. People will do it if they really want to. Maybe we could do something like the --i-am-really-sure flag we have in the nomad cli? Something which you actively need to add, instead of maybe accidentally hitting y out of habit?

I'm on board with that. Do we make it work just like Nomad where the user has to run the program again with this flag or do we let the user type in "I am really sure!". Note I capitalized the "I" and added an exclamation mark there to make for a stronger approval.

Though we want to get strong consent from users either by typing or a flag, I would say, it is still important to keep track of the old data at the same time. So that nobody can exploit this functionality for any unpleasant intention.

But, Appending new data regarding new fields can be considered and nice to have.

sherjeelshabih · 2023-09-19T12:41:17Z

I am strongly against adding an option to overwrite existent data especially if there is not a rock-solid mechanism in-place which keeps hashes and provenance of each piece of information or groupings of pieces of information.

I understand. But the problem is if a user wants to replace this, do we block them from doing so and ignore the replacing keys for now? So no asking (y/n) and just warning them about it?
They could always go around our code and replace anything they like. in the h5 files. So I feel like giving the user more power to outright replace what they like using our framework is better.

I see the point. People will do it if they really want to. Maybe we could do something like the --i-am-really-sure flag we have in the nomad cli? Something which you actively need to add, instead of maybe accidentally hitting y out of habit?

I'm on board with that. Do we make it work just like Nomad where the user has to run the program again with this flag or do we let the user type in "I am really sure!". Note I capitalized the "I" and added an exclamation mark there to make for a stronger approval.

Though we want to get strong consent from users either by typing or a flag, I would say, it is still important to keep track of the old data at the same time. So that nobody can exploit this functionality for any unpleasant intention.

But, Appending new data can be considered and nice to have.

The issue is we cannot prevent anyone from doing anything nefarious. It will be good to provide an option to keep old data. But in most cases, we will only append /nexus/paths. We will most likely never overwrite "raw" h5 data paths. Where I could see this overwrite conflict is from entries added by one of our readers. In that case, the overwrite will provide a more recent either bug fixed version of the /nexus/path in question or a more feature rich version. For example, we gain some new plot functionality and a user wants to quickly update their NXS files to gain this new feature. I want to accommodate these users and not make it a hassle for them ~~in the name of~~ behind the curtain of trying to prevent nefarious acts. Those folks will find an even simpler way to manipulate the data.

We can, in the future, provide an option to preserve old data under certain NeXus concepts with pynxtool/reader versions, etc.

rettigl · 2024-08-20T21:07:35Z

I would find such an option very important, e.g. to add evaluation results to a file. What is the status on this?

ka-sarthak · 2025-01-14T15:53:01Z

I find this quite important too. I would like to test the the append mode for writing/overwriting the nexus files. Are there plans to merge this soon? @sherjeelshabih @lukaspie @RubelMozumder

sanbrock · 2025-01-14T17:56:06Z

To open an existing file for adding/modifying its data one can open it in 'r+' mode. 'a' mode enables adding content to a non-existing file by automatically creating it as a new file.
Note that the command line flag '--append' (also in its documentation) suggests that it is appending data to the file and not modifying its already contained datasets.
If we do not ensure in our implementation that the new information added to the file will not modify existing datasets/attributes, I suggest to call the new feature, '--merge' or '--update'. Otherwise, it could be '--add', or '--append'.

Note that the data converter does not know about the old file and does not check it, so cannot guarantee data consistency in it. Until know, it was the duty of the reader to guarantee that, but it only guarantees it for the content to be added/updated/merged. Any data left untouched, and their consistency (also semantic meaning) may be broken.
E.g.
/myentry/oldprocessing (which may stored the data processing results of the content in /enty) can become completely detached from an updated /myentry
This can lead to a complete misinterpretation. Hence, it must be stressed that the data consistency must be checked and guaranteed by the person/code running the data converter in this mode. I would probably chose an even stronger flag to reflect this (e.g. '--manipulate').

sanbrock · 2025-01-14T18:09:43Z

A related question is about the root attributes. Their semantic meaning (e.g. file creation, or which software has produced the file, or which version of the NeXus definitions were used, etc.) should cover the whole file which is not possible in this mode since we manipulate only part of it.
A better (or at least cleaner) approach is to create separate files and link their contents. All pieces should be consistent in themselves, and whoever is linking them using a new master file shall guarantee the consistency of the new tree.

sanbrock · 2025-01-14T18:12:43Z

Note that this approach in an RDM also solves the problem of versioning, because here, we always create a new (version of the) file, so the RDM can easily connect the different versions (e.g. using the experiment identifier).

RubelMozumder · 2025-01-15T13:05:01Z

In this whole discussion there are pros and cons, whatever we decide to implement:
Let's not confine ourselves to a hdf5 file as a standalone object,

merge/add (In Append mode):
Pros:

One can add different concepts (quantities/fields, sections/groups) in a hdf5 file, disregarding what file already contains. (Cons: File will end up as a dump storage, not anymore relevant with any specific technical application definitions and version. Which is completely misleading and not suggested by nomad.) We can think it only when pynxtools is a standalone tools.
Do we also want to append the value of a concept? I think, This would be a complete messed up.

Overwrite (In Append mode):

Partial:

This would check each existing concept in the hdf5 file without any modification. At the same time, it also add new fields/concepts. (Cons: Existing metadata and data might be absolute and irrelavent at all. Which we do not want.)

Full:

It would overwrite existing concepts, and add new concepts, which is similar to w mode.
Having a mode, it allows to reading and writing of file at the same time. (Cons: There also resource contention might happen. If reading and writing of a concept occurs at the same time though might be very rare.)

We could think Overwrite (In Append mode) Full if it helps in nexus entry processing and writing in nomad.

rettigl · 2025-01-16T14:05:41Z

So what is the 0th version we can finalize quickly? We could have an --append option that only allows to write non-existing keys, and an --manipulate option that allows also overwriting, with warning.

A proper semantic checking of the result would require finalizing #333, I guess

A quick append mode for already existing hdf5 files

d2c72ca

rettigl mentioned this pull request Aug 20, 2024

Workflow for adding analysis results to an existing Nexus file #366

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A quick append mode for already existing hdf5 files #146

A quick append mode for already existing hdf5 files #146

sherjeelshabih commented Aug 23, 2023

coveralls commented Aug 23, 2023 •

edited

Loading

lukaspie commented Sep 19, 2023

mkuehbach commented Sep 19, 2023

sherjeelshabih commented Sep 19, 2023

domna commented Sep 19, 2023 •

edited

Loading

sherjeelshabih commented Sep 19, 2023

RubelMozumder commented Sep 19, 2023 •

edited

Loading

sherjeelshabih commented Sep 19, 2023 •

edited

Loading

rettigl commented Aug 20, 2024

ka-sarthak commented Jan 14, 2025

sanbrock commented Jan 14, 2025

sanbrock commented Jan 14, 2025

sanbrock commented Jan 14, 2025

RubelMozumder commented Jan 15, 2025 •

edited

Loading

rettigl commented Jan 16, 2025

A quick append mode for already existing hdf5 files #146

Are you sure you want to change the base?

A quick append mode for already existing hdf5 files #146

Conversation

sherjeelshabih commented Aug 23, 2023

coveralls commented Aug 23, 2023 • edited Loading

Pull Request Test Coverage Report for Build 5952173941

💛 - Coveralls

lukaspie commented Sep 19, 2023

mkuehbach commented Sep 19, 2023

sherjeelshabih commented Sep 19, 2023

domna commented Sep 19, 2023 • edited Loading

sherjeelshabih commented Sep 19, 2023

RubelMozumder commented Sep 19, 2023 • edited Loading

sherjeelshabih commented Sep 19, 2023 • edited Loading

rettigl commented Aug 20, 2024

ka-sarthak commented Jan 14, 2025

sanbrock commented Jan 14, 2025

sanbrock commented Jan 14, 2025

sanbrock commented Jan 14, 2025

RubelMozumder commented Jan 15, 2025 • edited Loading

rettigl commented Jan 16, 2025

coveralls commented Aug 23, 2023 •

edited

Loading

domna commented Sep 19, 2023 •

edited

Loading

RubelMozumder commented Sep 19, 2023 •

edited

Loading

sherjeelshabih commented Sep 19, 2023 •

edited

Loading

RubelMozumder commented Jan 15, 2025 •

edited

Loading