-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1046 - Add Dataset model #1052
1046 - Add Dataset model #1052
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did a great job here, Nozomi! I have a couple items for feedback, but overall it looks really good. Ping me if you'd like to discuss any of my feedback items.
One other general note is that it would be good to add comments above each group of attributes describing the attribute group (see the OriginalFile model for an example).
PR https://github.com/AlexsLemonade/egg/pull/20 might also be helpful to look at (this PR discusses conventions for Django models ).
Cheers!
class Migration(migrations.Migration): | ||
|
||
dependencies = [ | ||
("scpca_portal", "0054_tokendownload"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like updates from dev
weren't pulled in. dev
is currently at migration 0056
.
Steps would be:
- rollback your local db to 0054
- delete this migration
- pull in dev
- rerun
sportal makemigrations
- run
sportal migrate
from there, and you should be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidsmejia I've reapplied the migration as it was just a fix. Thank you!
(SINGLE_CELL_EXPERIMENT, "Single cell experiment"), | ||
) | ||
|
||
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field must be auto-incrementing as well.
Honestly, the id
field is added by default in Django, I think it would best fit with our conventions to let Django take care of it internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for this and for computed files in the future we will be using UUIDs so they can't be randomly guessed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the above, we'll keep it as a randomly generated UUID value.
class FileFormats: | ||
ANN_DATA = "ANN_DATA" | ||
SINGLE_CELL_EXPERIMENT = "SINGLE_CELL_EXPERIMENT" | ||
|
||
CHOICES = ( | ||
(ANN_DATA, "AnnData"), | ||
(SINGLE_CELL_EXPERIMENT, "Single cell experiment"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same code that we have with OutputFileModalities
in ComputedFile
. I feel like we should pull it out of ComputedFile
, put it into common
, and have both models reference the same code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! We will be addressing this in the next sprint (added to the next sprint doc) 👍
data = models.JSONField(default=dict) | ||
format = models.TextField(choices=FileFormats.CHOICES, null=True) | ||
email = models.EmailField(max_length=254, null=True) | ||
regenerated_from = models.ForeignKey( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels more like a OneToOneField
than a ForeignKey
. As I understand it, in the case of multiple regenerations, we'll be regenerating not from the original but from the previously generated dataset (think linked list instead of a parent node with multiple child nodes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll keep this as a ForeignKey
. A processed and expired dataset can be the source for multiple new datasets, and regenerated_from
will simply point to the dataset it was generated from. I also applied your feedback on on_delete
and updated related_name
to regenerated_datasets.
api/scpca_portal/models/dataset.py
Outdated
regenerated_from = models.ForeignKey( | ||
"self", | ||
null=True, | ||
on_delete=models.CASCADE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only utilize cascade deletion of related models if the models have a parent-child relationship (think project-sample). For models that exist either as standalone or as siblings, we should be setting their on_delete
fields to models.SET_NULL
.
on_delete=models.CASCADE, | |
on_delete=models.SET_NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great insight, thank you!
api/scpca_portal/models/dataset.py
Outdated
format = models.TextField(choices=FileFormats.CHOICES, null=True) | ||
email = models.EmailField(max_length=254, null=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, neither of these fields can be nullable. Also the default email max length is already 254, so no need to set it explicitly.
format = models.TextField(choices=FileFormats.CHOICES, null=True) | |
email = models.EmailField(max_length=254, null=True) | |
format = models.TextField(choices=FileFormats.CHOICES) | |
email = models.EmailField() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch on max_length
, I’ve removed it! As for email
, it can be null
initially before the dataset starts processing, so we’ll keep it as is. This field will be used for notifications and is user-editable. We’ve grouped the fields as editable and non-editable using comments for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About format
, I ran into this error during the migration,You are trying to add a non-nullable field 'format' to dataset without a default; we can't do that
. So to work around it, I made the field nullable to avoid the database constraint issue.
Should we set the default to a placeholder string (e.g., default='None'
) for the field? Or, would it make sense to define a default placeholder in common.py
for consistency, so we can standardize the default placeholder value for cases like this across the codebase (perhaps can we use NA
)?
Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is because you already have created the table. You need to roll back your migration and re-run makemigrations and delete the old migration file so that it is a single file. You can have non-nullable columns on new tables. If you already have a table the ORM doesnt know if there will be rows which would become invalid by the migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reapplied the migration again and now the format
field is no longer nullable. Thank you all!
api/scpca_portal/models/dataset.py
Outdated
token = models.OneToOneField( | ||
APIToken, | ||
null=True, | ||
on_delete=models.CASCADE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on_delete=models.CASCADE, | |
on_delete=models.SET_NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be foreignKey as well because multiple datasets can be started with any given token.
api/scpca_portal/models/dataset.py
Outdated
related_name="dataset_token", | ||
help_text="Token used to process the dataset.", | ||
) | ||
download_tokens = models.ManyToManyField( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we think this is a ManyToManyField and not a ForeignKey (on APIToken
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the correct relationship as a token can be used to download multiple datasets and every download will be tracked via the token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We kept it as a ManyToManyField
, but updated related_name
to downloaded_datasets
👍
api/scpca_portal/models/dataset.py
Outdated
) | ||
start = models.BooleanField( | ||
null=True, | ||
help_text="Indicates if the dataset process has started.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm feeling that if we add help_text
for one BooleanField, then we should add it for the others. It's a conventions question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s a great point and makes sense! David and I chatted, and we’re thinking of going over the codebase later to add help_text
where needed. But for now, we’ll hold off since it’s not part of the current pattern - TBD for later.
api/scpca_portal/models/dataset.py
Outdated
computed_file = models.OneToOneField( | ||
ComputedFile, | ||
null=True, | ||
on_delete=models.CASCADE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one's more subtle. There is a parent-child relationship here, but in order to represent reality we'd have to also actually delete the file from s3. Until we have that set up, we should resort to SET_NULL.
on_delete=models.CASCADE, | |
on_delete=models.SET_NULL, |
api/scpca_portal/models/dataset.py
Outdated
token = models.OneToOneField( | ||
APIToken, | ||
null=True, | ||
on_delete=models.CASCADE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be foreignKey as well because multiple datasets can be started with any given token.
api/scpca_portal/models/dataset.py
Outdated
related_name="dataset_token", | ||
help_text="Token used to process the dataset.", | ||
) | ||
download_tokens = models.ManyToManyField( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the correct relationship as a token can be used to download multiple datasets and every download will be tracked via the token.
api/scpca_portal/models/dataset.py
Outdated
processed_at = models.DateTimeField(null=True) | ||
is_processed = models.BooleanField(default=False) | ||
expires_at = models.DateTimeField(null=True) | ||
is_expired = models.BooleanField(default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment there saying that this will be evaluated / set in a cron?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great insights, thank you!
I've applied both of your feedback, and this PR is ready for another look. Thank you again for the valuable insights! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great Nozomi! 🚀
Issue Number
Closes #1046
Purpose/Implementation Notes
Changes include:
Dataset
modelrelated_name
withdataset_
for the following fields:computed_file
download_tokens
token
related_name
tooriginal_dataset
forregenerated_from
help_text
to fields that are not fully self-explanatoryTypes of changes
Functional tests
N/A
Checklist
Screenshots
N/A