Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A quick append mode for already existing hdf5 files #146

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sherjeelshabih
Copy link
Collaborator

This is in line with what @mkuehbach brought up. I just cooked up some basic draft to share ideas on this.

One of the questions we had was what to do if we are adding new data to an existing hdf5 path in the file. We decided to offer a simple (y/n) prompt for the overwrite. But this can be discussed.

If there is anything else you would like to add please feel free to work on this branch or leave comments.

@domna @RubelMozumder @sanbrock @lukaspie

@coveralls
Copy link

coveralls commented Aug 23, 2023

Pull Request Test Coverage Report for Build 5952173941

  • 18 of 18 (100.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.07%) to 49.569%

Totals Coverage Status
Change from base Build 5715296537: 0.07%
Covered Lines: 5405
Relevant Lines: 10904

💛 - Coveralls

@lukaspie
Copy link
Collaborator

I fully support having the possibility to append data. Maybe this could also work nicely with the ideas of Area A that were discussed yesterday, i.e., allowing for appending ELN data individually (e.g., from a python dict)? In any way, the proposed changes lgtm.

Overwriting old data seems a bit more tricky since you wouldn't want to accidentally overwrite existing data (especially if the writing is automated, e.g. during exporting of experimental data), A y/n prompt may be a good compromise.

@mkuehbach
Copy link
Collaborator

I am strongly against adding an option to overwrite existent data especially if there is not a rock-solid mechanism in-place which keeps hashes and provenance of each piece of information or groupings of pieces of information.

@sherjeelshabih
Copy link
Collaborator Author

I am strongly against adding an option to overwrite existent data especially if there is not a rock-solid mechanism in-place which keeps hashes and provenance of each piece of information or groupings of pieces of information.

I understand. But the problem is if a user wants to replace this, do we block them from doing so and ignore the replacing keys for now? So no asking (y/n) and just warning them about it?

They could always go around our code and replace anything they like. in the h5 files. So I feel like giving the user more power to outright replace what they like using our framework is better.

@domna
Copy link
Collaborator

domna commented Sep 19, 2023

I am strongly against adding an option to overwrite existent data especially if there is not a rock-solid mechanism in-place which keeps hashes and provenance of each piece of information or groupings of pieces of information.

I understand. But the problem is if a user wants to replace this, do we block them from doing so and ignore the replacing keys for now? So no asking (y/n) and just warning them about it?

They could always go around our code and replace anything they like. in the h5 files. So I feel like giving the user more power to outright replace what they like using our framework is better.

I see the point. People will do it if they really want to. Maybe we could do something like the --i-am-really-sure flag we have in the nomad cli? Something which you actively need to add, instead of maybe accidentally hitting y out of habit?

@sherjeelshabih
Copy link
Collaborator Author

I am strongly against adding an option to overwrite existent data especially if there is not a rock-solid mechanism in-place which keeps hashes and provenance of each piece of information or groupings of pieces of information.

I understand. But the problem is if a user wants to replace this, do we block them from doing so and ignore the replacing keys for now? So no asking (y/n) and just warning them about it?
They could always go around our code and replace anything they like. in the h5 files. So I feel like giving the user more power to outright replace what they like using our framework is better.

I see the point. People will do it if they really want to. Maybe we could do something like the --i-am-really-sure flag we have in the nomad cli? Something which you actively need to add, instead of maybe accidentally hitting y out of habit?

I'm on board with that. Do we make it work just like Nomad where the user has to run the program again with this flag or do we let the user type in "I am really sure!". Note I capitalized the "I" and added an exclamation mark there to make for a stronger approval.

@RubelMozumder
Copy link
Collaborator

RubelMozumder commented Sep 19, 2023

I am strongly against adding an option to overwrite existent data especially if there is not a rock-solid mechanism in-place which keeps hashes and provenance of each piece of information or groupings of pieces of information.

I understand. But the problem is if a user wants to replace this, do we block them from doing so and ignore the replacing keys for now? So no asking (y/n) and just warning them about it?
They could always go around our code and replace anything they like. in the h5 files. So I feel like giving the user more power to outright replace what they like using our framework is better.

I see the point. People will do it if they really want to. Maybe we could do something like the --i-am-really-sure flag we have in the nomad cli? Something which you actively need to add, instead of maybe accidentally hitting y out of habit?

I'm on board with that. Do we make it work just like Nomad where the user has to run the program again with this flag or do we let the user type in "I am really sure!". Note I capitalized the "I" and added an exclamation mark there to make for a stronger approval.

Though we want to get strong consent from users either by typing or a flag, I would say, it is still important to keep track of the old data at the same time. So that nobody can exploit this functionality for any unpleasant intention.

But, Appending new data regarding new fields can be considered and nice to have.

@sherjeelshabih
Copy link
Collaborator Author

sherjeelshabih commented Sep 19, 2023

I am strongly against adding an option to overwrite existent data especially if there is not a rock-solid mechanism in-place which keeps hashes and provenance of each piece of information or groupings of pieces of information.

I understand. But the problem is if a user wants to replace this, do we block them from doing so and ignore the replacing keys for now? So no asking (y/n) and just warning them about it?
They could always go around our code and replace anything they like. in the h5 files. So I feel like giving the user more power to outright replace what they like using our framework is better.

I see the point. People will do it if they really want to. Maybe we could do something like the --i-am-really-sure flag we have in the nomad cli? Something which you actively need to add, instead of maybe accidentally hitting y out of habit?

I'm on board with that. Do we make it work just like Nomad where the user has to run the program again with this flag or do we let the user type in "I am really sure!". Note I capitalized the "I" and added an exclamation mark there to make for a stronger approval.

Though we want to get strong consent from users either by typing or a flag, I would say, it is still important to keep track of the old data at the same time. So that nobody can exploit this functionality for any unpleasant intention.

But, Appending new data can be considered and nice to have.

The issue is we cannot prevent anyone from doing anything nefarious. It will be good to provide an option to keep old data. But in most cases, we will only append /nexus/paths. We will most likely never overwrite "raw" h5 data paths. Where I could see this overwrite conflict is from entries added by one of our readers. In that case, the overwrite will provide a more recent either bug fixed version of the /nexus/path in question or a more feature rich version. For example, we gain some new plot functionality and a user wants to quickly update their NXS files to gain this new feature. I want to accommodate these users and not make it a hassle for them in the name of behind the curtain of trying to prevent nefarious acts. Those folks will find an even simpler way to manipulate the data.

We can, in the future, provide an option to preserve old data under certain NeXus concepts with pynxtool/reader versions, etc.

@rettigl
Copy link
Collaborator

rettigl commented Aug 20, 2024

I would find such an option very important, e.g. to add evaluation results to a file. What is the status on this?

@ka-sarthak
Copy link

I find this quite important too. I would like to test the the append mode for writing/overwriting the nexus files. Are there plans to merge this soon? @sherjeelshabih @lukaspie @RubelMozumder

@sanbrock
Copy link
Collaborator

To open an existing file for adding/modifying its data one can open it in 'r+' mode. 'a' mode enables adding content to a non-existing file by automatically creating it as a new file.
Note that the command line flag '--append' (also in its documentation) suggests that it is appending data to the file and not modifying its already contained datasets.
If we do not ensure in our implementation that the new information added to the file will not modify existing datasets/attributes, I suggest to call the new feature, '--merge' or '--update'. Otherwise, it could be '--add', or '--append'.

Note that the data converter does not know about the old file and does not check it, so cannot guarantee data consistency in it. Until know, it was the duty of the reader to guarantee that, but it only guarantees it for the content to be added/updated/merged. Any data left untouched, and their consistency (also semantic meaning) may be broken.
E.g.
/myentry/oldprocessing (which may stored the data processing results of the content in /enty) can become completely detached from an updated /myentry
This can lead to a complete misinterpretation. Hence, it must be stressed that the data consistency must be checked and guaranteed by the person/code running the data converter in this mode. I would probably chose an even stronger flag to reflect this (e.g. '--manipulate').

@sanbrock
Copy link
Collaborator

A related question is about the root attributes. Their semantic meaning (e.g. file creation, or which software has produced the file, or which version of the NeXus definitions were used, etc.) should cover the whole file which is not possible in this mode since we manipulate only part of it.
A better (or at least cleaner) approach is to create separate files and link their contents. All pieces should be consistent in themselves, and whoever is linking them using a new master file shall guarantee the consistency of the new tree.

@sanbrock
Copy link
Collaborator

Note that this approach in an RDM also solves the problem of versioning, because here, we always create a new (version of the) file, so the RDM can easily connect the different versions (e.g. using the experiment identifier).

@RubelMozumder
Copy link
Collaborator

RubelMozumder commented Jan 15, 2025

In this whole discussion there are pros and cons, whatever we decide to implement:
Let's not confine ourselves to a hdf5 file as a standalone object,

merge/add (In Append mode):
Pros:

  1. One can add different concepts (quantities/fields, sections/groups) in a hdf5 file, disregarding what file already contains. (Cons: File will end up as a dump storage, not anymore relevant with any specific technical application definitions and version. Which is completely misleading and not suggested by nomad.) We can think it only when pynxtools is a standalone tools.
    Do we also want to append the value of a concept? I think, This would be a complete messed up.

Overwrite (In Append mode):

Partial:

  1. This would check each existing concept in the hdf5 file without any modification. At the same time, it also add new fields/concepts. (Cons: Existing metadata and data might be absolute and irrelavent at all. Which we do not want.)

Full:

  1. It would overwrite existing concepts, and add new concepts, which is similar to w mode.
    Having a mode, it allows to reading and writing of file at the same time. (Cons: There also resource contention might happen. If reading and writing of a concept occurs at the same time though might be very rare.)

We could think Overwrite (In Append mode) Full if it helps in nexus entry processing and writing in nomad.

@rettigl
Copy link
Collaborator

rettigl commented Jan 16, 2025

So what is the 0th version we can finalize quickly? We could have an --append option that only allows to write non-existing keys, and an --manipulate option that allows also overwriting, with warning.

A proper semantic checking of the result would require finalizing #333, I guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants