Add ability to delete datasets #94

stuartmcalpine · 2024-01-10T14:00:38Z

Can now delete datasets using Registrar.dataset.delete(<dataset_id>).

Function :

Checks the dataset exists, and that it is a "valid" (0 or 1) status
Changes the status to 3 and sets the delete_date and delete_uid fields
Deletes the data in the root_dir

Have added unit tests to make sure these things are happening

(Had to modify the previous_dataset function to be used also to search on dataset_id for this function to also use)

stuartmcalpine · 2024-02-09T15:27:11Z

I've rebased this to work with the Registrar refactoring.

Right now only datasets can be deleted, and they have their own local delete() function in the DatasetTable class. I think this one still needs to be separate as only the delete function for datasets comes with deleting data also.

I can add deleting entries for the other tables (just their database entries), if we want that functionality. Could also leave that for another PR. Maybe there is no need to delete entries of other types, just the ability to modify them?

JoanneBogart · 2024-02-15T22:46:07Z

General comment - for some reason (maybe because of change of base?) the diffs shown for dataset.py in the PR are much more extensive than the actual changes to the code. To review I had to do my own manual diff by looking at the two versions of the code side by side. I'm not confident I've seen everything I should have seen.

stuartmcalpine · 2024-02-15T23:23:25Z

I'm not sure why the diff has diffed the whole DatasetTable class (it has the right target branch).

In dataset.py, the delete() function is new. What was find_previous is now find_entry() which locates a record. This will have a generic version also in the base class which is why i named it more generic, but as with other things for dataset it has some extra functionality (it can search on the file path rather than just the ID). I think the rest of the class is the same.

Then the rest seems ok, just the new CI tests.

JoanneBogart

There are a couple of items to look at in the specific comments.

JoanneBogart · 2024-02-15T23:01:25Z

src/dataregistry/registrar/dataset.py

+				root_dir=self._root_dir,
+			)
+			print(f"Deleting data {data_path}")
+			os.remove(data_path)


This only removes regular files. If the path is the directory you'll need to do something else. shutil.rmtree looks like it does the right thing.

Yes, well spotted.

I have changed to rmtree for directories, and added deleting a dir into the unit tests.

JoanneBogart · 2024-02-15T23:25:53Z

src/dataregistry/registrar/dataset.py

+
+		return dataset_organization, num_files, total_size, ds_creation_date
+
+	def _find_entry(


I don't see any advantage to combining the functions find-by-id and find-by-path. What happens if someone supplies all arguments and they conflict? The way you have coded it find-by-id would win, but there is no warning issued. Better to avoid the problem altogether by keeping the old _find_previous function and adding a new function, e.g. _find_by_id, which only accepts an id argument.
You say in the docstring "only one dataset should ever by found", but that's not true if the routine is called using relative_path, owner_type and owner arguments. There can be multiple entries in your result set if the dataset has been overwritten. Old entries will still be in the db, marked as overwritten. There could be more than one. Under normal circumstances, all but one should be labeled as overwritten. In the original _find_previous routine my thinking was that if there had been some glitch (e.g., bug in earlier version of code or database failure in a previous register operation before that field could be updated) causing there to be more than one old entry not marked as overwritten that it might as well be fixed this time around.

I have split back into two functions. There is now a find_entry method in the base table class, which can search for an entry based on an ID (can be used for any table). And find_previous as before, which finds all datasets with the combination of relative_path, owner, and owner_type and sees if the latest entry is overwritable.

JoanneBogart

Still some work to be done on _find_previous.
The delete function good.

JoanneBogart · 2024-02-16T20:26:44Z

src/dataregistry/registrar/dataset.py

+            result = conn.execute(stmt)
+            conn.commit()
+
+        # Pull out the single result


As I explained in my earlier comment, there is not necessarily a single result. There can be multiple entries with the same path which have been previously overwritten (and are marked as overwritten). There should be only a single dataset which both has the same path, is overwritable and has not yet been marked as overwritten, but I believe it's possible, although very unlikely, that there could be more than one. So this case should be handled as well. Since this routine is now again single-purpose, the return can be the same as before: a list of ids which a) have the same path as the one specified by the arguments and b) are overwritable but have not yet been marked as overwritten. A return of "None" means an id was found for a dataset which is not overwritable, which is an error.
See the code at

dataregistry/src/dataregistry/registrar.py

Line 125 in bb23167

# If the datasets are overwritable, log their ID, else return None

on the main branch.
registrar.dataset.register will also need to be changed where it calls _find_previous to accept this form of return. It should be more or less the way it is at lines 165 and 238 in the old code on the main branch.

I think it's back to the desired behavior now.

find_previous looks for all datasets with a given path combination. If any of those has is_overwritable=False, None is returned and an error is raised. Else it returns all the found datasets which have is_overwritten=False.

JoanneBogart

I left a couple lesser comments. Perhaps you could implement the one about the select in _find_previous this round. The other comment about the base class could be addressed some other time. There may be other things that should go in the base class.

JoanneBogart · 2024-03-05T19:01:31Z

src/dataregistry/registrar/base_table_class.py

+        if self.which_table == "dataset":
+            stmt = select(my_table).where(my_table.c.dataset_id == entry_id)
+        else:
+            raise ValueError("Can only perform `find_entry` on dataset table for now")


This is ok for now, but we should get rid of this restriction.
For example, add arguments table_name, primary_key to BaseTable.__init__( ) and save those values as self.table_name, self.primary_key. (Or maybe one or both of these could be looked up using sqlalchemy functions.) Then find_entry works for all tables.

Yes I agree, was leaving until next PR so i didnt forget to test it properly

src/dataregistry/registrar/dataset.py

Add ability to delete a dataset

416a461

stuartmcalpine requested a review from JoanneBogart January 10, 2024 14:00

stuartmcalpine changed the base branch from main to u/stuart/dataset_status January 10, 2024 14:01

Update delete test

0a50c73

Base automatically changed from u/stuart/dataset_status to u/stuart/reformat_registrar February 9, 2024 13:51

stuartmcalpine added 2 commits February 9, 2024 15:04

Fix conflicts

0e3e07a

Fix ability to delete datasets with new restructuring

c958971

stuartmcalpine mentioned this pull request Feb 9, 2024

Placeholder branch for v0.4.0 #91

Merged

JoanneBogart requested changes Feb 15, 2024

View reviewed changes

stuartmcalpine added 2 commits February 16, 2024 20:15

Split find_entry into find_previous and a universal find_entry

cac9856

Add proper function for removing directories

ad408fd

stuartmcalpine requested a review from JoanneBogart February 16, 2024 19:31

JoanneBogart requested changes Feb 16, 2024

View reviewed changes

Revert find_previous behaviour

21cad22

stuartmcalpine requested a review from JoanneBogart February 17, 2024 11:51

JoanneBogart approved these changes Mar 5, 2024

View reviewed changes

Address review comments

9064614

stuartmcalpine merged commit 0fea219 into u/stuart/reformat_registrar Mar 7, 2024
8 checks passed

stuartmcalpine deleted the u/stuart/delete_datasets branch March 7, 2024 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to delete datasets #94

Add ability to delete datasets #94

stuartmcalpine commented Jan 10, 2024 •

edited

Loading

stuartmcalpine commented Feb 9, 2024

JoanneBogart commented Feb 15, 2024 •

edited

Loading

stuartmcalpine commented Feb 15, 2024 •

edited

Loading

JoanneBogart left a comment

JoanneBogart Feb 15, 2024

stuartmcalpine Feb 16, 2024

JoanneBogart Feb 15, 2024

stuartmcalpine Feb 16, 2024

JoanneBogart left a comment

JoanneBogart Feb 16, 2024

stuartmcalpine Feb 17, 2024

JoanneBogart left a comment

JoanneBogart Mar 5, 2024

stuartmcalpine Mar 7, 2024


		return dataset_organization, num_files, total_size, ds_creation_date

		def _find_entry(

Add ability to delete datasets #94

Add ability to delete datasets #94

Conversation

stuartmcalpine commented Jan 10, 2024 • edited Loading

stuartmcalpine commented Feb 9, 2024

JoanneBogart commented Feb 15, 2024 • edited Loading

stuartmcalpine commented Feb 15, 2024 • edited Loading

JoanneBogart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoanneBogart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoanneBogart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuartmcalpine commented Jan 10, 2024 •

edited

Loading

JoanneBogart commented Feb 15, 2024 •

edited

Loading

stuartmcalpine commented Feb 15, 2024 •

edited

Loading