Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1046 - Add Dataset model #1052

Merged
merged 8 commits into from
Jan 17, 2025
Merged

1046 - Add Dataset model #1052

merged 8 commits into from
Jan 17, 2025

Conversation

nozomione
Copy link
Member

@nozomione nozomione commented Jan 14, 2025

Issue Number

Closes #1046

Purpose/Implementation Notes

Changes include:

  • Implemented the initial Dataset model
  • Applied the migration to update the database schema
  • To avoid reverse query name clashes:
    • Prefixed related_name with dataset_ for the following fields:
      • computed_file
      • download_tokens
      • token
    • Set related_name to original_dataset for regenerated_from
  • Added help_text to fields that are not fully self-explanatory

Types of changes

  • New feature (non-breaking change which adds functionality)

Functional tests

N/A

Checklist

  • Lint and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in downstream modules

Screenshots

N/A

@nozomione nozomione self-assigned this Jan 14, 2025
Copy link
Contributor

@avrohomgottlieb avrohomgottlieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did a great job here, Nozomi! I have a couple items for feedback, but overall it looks really good. Ping me if you'd like to discuss any of my feedback items.

One other general note is that it would be good to add comments above each group of attributes describing the attribute group (see the OriginalFile model for an example).

PR https://github.com/AlexsLemonade/egg/pull/20 might also be helpful to look at (this PR discusses conventions for Django models ).

Cheers!

class Migration(migrations.Migration):

dependencies = [
("scpca_portal", "0054_tokendownload"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like updates from dev weren't pulled in. dev is currently at migration 0056.

Steps would be:

  • rollback your local db to 0054
  • delete this migration
  • pull in dev
  • rerun sportal makemigrations
  • run sportal migrate from there, and you should be good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidsmejia I've reapplied the migration as it was just a fix. Thank you!

(SINGLE_CELL_EXPERIMENT, "Single cell experiment"),
)

id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field must be auto-incrementing as well.

Honestly, the id field is added by default in Django, I think it would best fit with our conventions to let Django take care of it internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for this and for computed files in the future we will be using UUIDs so they can't be randomly guessed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the above, we'll keep it as a randomly generated UUID value.

Comment on lines +15 to +22
class FileFormats:
ANN_DATA = "ANN_DATA"
SINGLE_CELL_EXPERIMENT = "SINGLE_CELL_EXPERIMENT"

CHOICES = (
(ANN_DATA, "AnnData"),
(SINGLE_CELL_EXPERIMENT, "Single cell experiment"),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same code that we have with OutputFileModalities in ComputedFile. I feel like we should pull it out of ComputedFile, put it into common, and have both models reference the same code.

Copy link
Member Author

@nozomione nozomione Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! We will be addressing this in the next sprint (added to the next sprint doc) 👍

data = models.JSONField(default=dict)
format = models.TextField(choices=FileFormats.CHOICES, null=True)
email = models.EmailField(max_length=254, null=True)
regenerated_from = models.ForeignKey(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels more like a OneToOneField than a ForeignKey. As I understand it, in the case of multiple regenerations, we'll be regenerating not from the original but from the previously generated dataset (think linked list instead of a parent node with multiple child nodes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll keep this as a ForeignKey. A processed and expired dataset can be the source for multiple new datasets, and regenerated_from will simply point to the dataset it was generated from. I also applied your feedback on on_delete and updated related_name to regenerated_datasets.

regenerated_from = models.ForeignKey(
"self",
null=True,
on_delete=models.CASCADE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only utilize cascade deletion of related models if the models have a parent-child relationship (think project-sample). For models that exist either as standalone or as siblings, we should be setting their on_delete fields to models.SET_NULL.

Suggested change
on_delete=models.CASCADE,
on_delete=models.SET_NULL,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great insight, thank you!

Comment on lines 26 to 27
format = models.TextField(choices=FileFormats.CHOICES, null=True)
email = models.EmailField(max_length=254, null=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, neither of these fields can be nullable. Also the default email max length is already 254, so no need to set it explicitly.

Suggested change
format = models.TextField(choices=FileFormats.CHOICES, null=True)
email = models.EmailField(max_length=254, null=True)
format = models.TextField(choices=FileFormats.CHOICES)
email = models.EmailField()

Copy link
Member Author

@nozomione nozomione Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch on max_length, I’ve removed it! As for email, it can be null initially before the dataset starts processing, so we’ll keep it as is. This field will be used for notifications and is user-editable. We’ve grouped the fields as editable and non-editable using comments for clarity.

Copy link
Member Author

@nozomione nozomione Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About format, I ran into this error during the migration,You are trying to add a non-nullable field 'format' to dataset without a default; we can't do that. So to work around it, I made the field nullable to avoid the database constraint issue.

Should we set the default to a placeholder string (e.g., default='None') for the field? Or, would it make sense to define a default placeholder in common.py for consistency, so we can standardize the default placeholder value for cases like this across the codebase (perhaps can we use NA)?

Let me know what you think!

Copy link
Contributor

@davidsmejia davidsmejia Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is because you already have created the table. You need to roll back your migration and re-run makemigrations and delete the old migration file so that it is a single file. You can have non-nullable columns on new tables. If you already have a table the ORM doesnt know if there will be rows which would become invalid by the migration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reapplied the migration again and now the format field is no longer nullable. Thank you all!

token = models.OneToOneField(
APIToken,
null=True,
on_delete=models.CASCADE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
on_delete=models.CASCADE,
on_delete=models.SET_NULL,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be foreignKey as well because multiple datasets can be started with any given token.

related_name="dataset_token",
help_text="Token used to process the dataset.",
)
download_tokens = models.ManyToManyField(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we think this is a ManyToManyField and not a ForeignKey (on APIToken)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the correct relationship as a token can be used to download multiple datasets and every download will be tracked via the token.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We kept it as a ManyToManyField, but updated related_name to downloaded_datasets 👍

)
start = models.BooleanField(
null=True,
help_text="Indicates if the dataset process has started.",
Copy link
Contributor

@avrohomgottlieb avrohomgottlieb Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling that if we add help_text for one BooleanField, then we should add it for the others. It's a conventions question.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a great point and makes sense! David and I chatted, and we’re thinking of going over the codebase later to add help_text where needed. But for now, we’ll hold off since it’s not part of the current pattern - TBD for later.

computed_file = models.OneToOneField(
ComputedFile,
null=True,
on_delete=models.CASCADE,
Copy link
Contributor

@avrohomgottlieb avrohomgottlieb Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one's more subtle. There is a parent-child relationship here, but in order to represent reality we'd have to also actually delete the file from s3. Until we have that set up, we should resort to SET_NULL.

Suggested change
on_delete=models.CASCADE,
on_delete=models.SET_NULL,

token = models.OneToOneField(
APIToken,
null=True,
on_delete=models.CASCADE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be foreignKey as well because multiple datasets can be started with any given token.

related_name="dataset_token",
help_text="Token used to process the dataset.",
)
download_tokens = models.ManyToManyField(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the correct relationship as a token can be used to download multiple datasets and every download will be tracked via the token.

processed_at = models.DateTimeField(null=True)
is_processed = models.BooleanField(default=False)
expires_at = models.DateTimeField(null=True)
is_expired = models.BooleanField(default=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment there saying that this will be evaluated / set in a cron?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great insights, thank you!

@nozomione
Copy link
Member Author

I've applied both of your feedback, and this PR is ready for another look. Thank you again for the valuable insights!

Copy link
Contributor

@davidsmejia davidsmejia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@avrohomgottlieb avrohomgottlieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great Nozomi! 🚀

@nozomione nozomione merged commit 010edaf into dev Jan 17, 2025
5 checks passed
@nozomione nozomione deleted the nozomione/1046-add-dataset-model branch January 17, 2025 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Dataset model
3 participants