-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to delete datasets #94
Add ability to delete datasets #94
Conversation
I've rebased this to work with the Right now only I can add deleting entries for the other tables (just their database entries), if we want that functionality. Could also leave that for another PR. Maybe there is no need to delete entries of other types, just the ability to modify them? |
General comment - for some reason (maybe because of change of base?) the diffs shown for |
I'm not sure why the diff has diffed the whole In Then the rest seems ok, just the new CI tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of items to look at in the specific comments.
root_dir=self._root_dir, | ||
) | ||
print(f"Deleting data {data_path}") | ||
os.remove(data_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only removes regular files. If the path is the directory you'll need to do something else. shutil.rmtree
looks like it does the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, well spotted.
I have changed to rmtree for directories, and added deleting a dir into the unit tests.
|
||
return dataset_organization, num_files, total_size, ds_creation_date | ||
|
||
def _find_entry( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any advantage to combining the functions find-by-id and find-by-path. What happens if someone supplies all arguments and they conflict? The way you have coded it find-by-id would win, but there is no warning issued. Better to avoid the problem altogether by keeping the old _find_previous
function and adding a new function, e.g. _find_by_id
, which only accepts an id argument.
You say in the docstring "only one dataset should ever by found", but that's not true if the routine is called using relative_path, owner_type and owner arguments. There can be multiple entries in your result set if the dataset has been overwritten. Old entries will still be in the db, marked as overwritten. There could be more than one. Under normal circumstances, all but one should be labeled as overwritten. In the original _find_previous routine my thinking was that if there had been some glitch (e.g., bug in earlier version of code or database failure in a previous register operation before that field could be updated) causing there to be more than one old entry not marked as overwritten that it might as well be fixed this time around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have split back into two functions. There is now a find_entry
method in the base table class, which can search for an entry based on an ID (can be used for any table). And find_previous
as before, which finds all datasets with the combination of relative_path
, owner
, and owner_type
and sees if the latest entry is overwritable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still some work to be done on _find_previous
.
The delete function good.
result = conn.execute(stmt) | ||
conn.commit() | ||
|
||
# Pull out the single result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I explained in my earlier comment, there is not necessarily a single result. There can be multiple entries with the same path which have been previously overwritten (and are marked as overwritten). There should be only a single dataset which both has the same path, is overwritable and has not yet been marked as overwritten, but I believe it's possible, although very unlikely, that there could be more than one. So this case should be handled as well. Since this routine is now again single-purpose, the return can be the same as before: a list of ids which a) have the same path as the one specified by the arguments and b) are overwritable but have not yet been marked as overwritten. A return of "None" means an id was found for a dataset which is not overwritable, which is an error.
See the code at
dataregistry/src/dataregistry/registrar.py
Line 125 in bb23167
# If the datasets are overwritable, log their ID, else return None |
registrar.dataset.register
will also need to be changed where it calls _find_previous
to accept this form of return. It should be more or less the way it is at lines 165 and 238 in the old code on the main branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's back to the desired behavior now.
find_previous
looks for all datasets with a given path combination. If any of those has is_overwritable=False
, None is returned and an error is raised. Else it returns all the found datasets which have is_overwritten=False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple lesser comments. Perhaps you could implement the one about the select
in _find_previous
this round. The other comment about the base class could be addressed some other time. There may be other things that should go in the base class.
if self.which_table == "dataset": | ||
stmt = select(my_table).where(my_table.c.dataset_id == entry_id) | ||
else: | ||
raise ValueError("Can only perform `find_entry` on dataset table for now") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ok for now, but we should get rid of this restriction.
For example, add arguments table_name
, primary_key
to BaseTable.__init__( )
and save those values as self.table_name
, self.primary_key
. (Or maybe one or both of these could be looked up using sqlalchemy functions.) Then find_entry
works for all tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree, was leaving until next PR so i didnt forget to test it properly
Can now delete datasets using
Registrar.dataset.delete(<dataset_id>)
.Function :
3
and sets thedelete_date
anddelete_uid
fieldsroot_dir
Have added unit tests to make sure these things are happening
(Had to modify the
previous_dataset
function to be used also to search ondataset_id
for this function to also use)