Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to delete datasets #94

Merged

Conversation

stuartmcalpine
Copy link
Collaborator

@stuartmcalpine stuartmcalpine commented Jan 10, 2024

Can now delete datasets using Registrar.dataset.delete(<dataset_id>).

Function :

  • Checks the dataset exists, and that it is a "valid" (0 or 1) status
  • Changes the status to 3 and sets the delete_date and delete_uid fields
  • Deletes the data in the root_dir

Have added unit tests to make sure these things are happening

(Had to modify the previous_dataset function to be used also to search on dataset_id for this function to also use)

@stuartmcalpine stuartmcalpine changed the base branch from main to u/stuart/dataset_status January 10, 2024 14:01
Base automatically changed from u/stuart/dataset_status to u/stuart/reformat_registrar February 9, 2024 13:51
@stuartmcalpine
Copy link
Collaborator Author

I've rebased this to work with the Registrar refactoring.

Right now only datasets can be deleted, and they have their own local delete() function in the DatasetTable class. I think this one still needs to be separate as only the delete function for datasets comes with deleting data also.

I can add deleting entries for the other tables (just their database entries), if we want that functionality. Could also leave that for another PR. Maybe there is no need to delete entries of other types, just the ability to modify them?

@JoanneBogart
Copy link
Collaborator

JoanneBogart commented Feb 15, 2024

General comment - for some reason (maybe because of change of base?) the diffs shown for dataset.py in the PR are much more extensive than the actual changes to the code. To review I had to do my own manual diff by looking at the two versions of the code side by side. I'm not confident I've seen everything I should have seen.

@stuartmcalpine
Copy link
Collaborator Author

stuartmcalpine commented Feb 15, 2024

I'm not sure why the diff has diffed the whole DatasetTable class (it has the right target branch).

In dataset.py, the delete() function is new. What was find_previous is now find_entry() which locates a record. This will have a generic version also in the base class which is why i named it more generic, but as with other things for dataset it has some extra functionality (it can search on the file path rather than just the ID). I think the rest of the class is the same.

Then the rest seems ok, just the new CI tests.

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of items to look at in the specific comments.

root_dir=self._root_dir,
)
print(f"Deleting data {data_path}")
os.remove(data_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only removes regular files. If the path is the directory you'll need to do something else. shutil.rmtree looks like it does the right thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, well spotted.

I have changed to rmtree for directories, and added deleting a dir into the unit tests.


return dataset_organization, num_files, total_size, ds_creation_date

def _find_entry(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any advantage to combining the functions find-by-id and find-by-path. What happens if someone supplies all arguments and they conflict? The way you have coded it find-by-id would win, but there is no warning issued. Better to avoid the problem altogether by keeping the old _find_previous function and adding a new function, e.g. _find_by_id, which only accepts an id argument.
You say in the docstring "only one dataset should ever by found", but that's not true if the routine is called using relative_path, owner_type and owner arguments. There can be multiple entries in your result set if the dataset has been overwritten. Old entries will still be in the db, marked as overwritten. There could be more than one. Under normal circumstances, all but one should be labeled as overwritten. In the original _find_previous routine my thinking was that if there had been some glitch (e.g., bug in earlier version of code or database failure in a previous register operation before that field could be updated) causing there to be more than one old entry not marked as overwritten that it might as well be fixed this time around.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have split back into two functions. There is now a find_entry method in the base table class, which can search for an entry based on an ID (can be used for any table). And find_previous as before, which finds all datasets with the combination of relative_path, owner, and owner_type and sees if the latest entry is overwritable.

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some work to be done on _find_previous.
The delete function good.

result = conn.execute(stmt)
conn.commit()

# Pull out the single result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I explained in my earlier comment, there is not necessarily a single result. There can be multiple entries with the same path which have been previously overwritten (and are marked as overwritten). There should be only a single dataset which both has the same path, is overwritable and has not yet been marked as overwritten, but I believe it's possible, although very unlikely, that there could be more than one. So this case should be handled as well. Since this routine is now again single-purpose, the return can be the same as before: a list of ids which a) have the same path as the one specified by the arguments and b) are overwritable but have not yet been marked as overwritten. A return of "None" means an id was found for a dataset which is not overwritable, which is an error.
See the code at

# If the datasets are overwritable, log their ID, else return None
on the main branch.
registrar.dataset.register will also need to be changed where it calls _find_previous to accept this form of return. It should be more or less the way it is at lines 165 and 238 in the old code on the main branch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's back to the desired behavior now.

find_previous looks for all datasets with a given path combination. If any of those has is_overwritable=False, None is returned and an error is raised. Else it returns all the found datasets which have is_overwritten=False.

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple lesser comments. Perhaps you could implement the one about the select in _find_previous this round. The other comment about the base class could be addressed some other time. There may be other things that should go in the base class.

if self.which_table == "dataset":
stmt = select(my_table).where(my_table.c.dataset_id == entry_id)
else:
raise ValueError("Can only perform `find_entry` on dataset table for now")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok for now, but we should get rid of this restriction.
For example, add arguments table_name, primary_key to BaseTable.__init__( ) and save those values as self.table_name, self.primary_key. (Or maybe one or both of these could be looked up using sqlalchemy functions.) Then find_entry works for all tables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, was leaving until next PR so i didnt forget to test it properly

src/dataregistry/registrar/dataset.py Outdated Show resolved Hide resolved
@stuartmcalpine stuartmcalpine merged commit 0fea219 into u/stuart/reformat_registrar Mar 7, 2024
8 checks passed
@stuartmcalpine stuartmcalpine deleted the u/stuart/delete_datasets branch March 7, 2024 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants