1046 - Add Dataset model #1052

nozomione · 2025-01-14T20:15:37Z

Issue Number

Closes #1046

Purpose/Implementation Notes

Changes include:

Implemented the initial Dataset model
Applied the migration to update the database schema
To avoid reverse query name clashes:
- Prefixed related_name with dataset_ for the following fields:
  - computed_file
  - download_tokens
  - token
- Set related_name to original_dataset for regenerated_from
Added help_text to fields that are not fully self-explanatory

Types of changes

New feature (non-breaking change which adds functionality)

Functional tests

N/A

Checklist

Lint and unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)
Any dependent changes have been merged and published in downstream modules

Screenshots

N/A

avrohomgottlieb

You did a great job here, Nozomi! I have a couple items for feedback, but overall it looks really good. Ping me if you'd like to discuss any of my feedback items.

One other general note is that it would be good to add comments above each group of attributes describing the attribute group (see the OriginalFile model for an example).

PR https://github.com/AlexsLemonade/egg/pull/20 might also be helpful to look at (this PR discusses conventions for Django models ).

Cheers!

avrohomgottlieb · 2025-01-14T21:24:44Z

api/scpca_portal/migrations/0055_dataset.py

+class Migration(migrations.Migration):
+
+    dependencies = [
+        ("scpca_portal", "0054_tokendownload"),


Looks like updates from dev weren't pulled in. dev is currently at migration 0056.

Steps would be:

rollback your local db to 0054

delete this migration

pull in dev

rerun sportal makemigrations

run sportal migrate from there, and you should be good

@davidsmejia I've reapplied the migration as it was just a fix. Thank you!

avrohomgottlieb · 2025-01-14T21:27:36Z

api/scpca_portal/models/dataset.py

+            (SINGLE_CELL_EXPERIMENT, "Single cell experiment"),
+        )
+
+    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)


This field must be auto-incrementing as well.

Honestly, the id field is added by default in Django, I think it would best fit with our conventions to let Django take care of it internally.

So for this and for computed files in the future we will be using UUIDs so they can't be randomly guessed.

Based on the above, we'll keep it as a randomly generated UUID value.

avrohomgottlieb · 2025-01-14T21:34:33Z

api/scpca_portal/models/dataset.py

+    class FileFormats:
+        ANN_DATA = "ANN_DATA"
+        SINGLE_CELL_EXPERIMENT = "SINGLE_CELL_EXPERIMENT"
+
+        CHOICES = (
+            (ANN_DATA, "AnnData"),
+            (SINGLE_CELL_EXPERIMENT, "Single cell experiment"),
+        )


This is the same code that we have with OutputFileModalities in ComputedFile. I feel like we should pull it out of ComputedFile, put it into common, and have both models reference the same code.

This is great! We will be addressing this in the next sprint (added to the next sprint doc) 👍

avrohomgottlieb · 2025-01-14T21:45:44Z

api/scpca_portal/models/dataset.py

+    data = models.JSONField(default=dict)
+    format = models.TextField(choices=FileFormats.CHOICES, null=True)
+    email = models.EmailField(max_length=254, null=True)
+    regenerated_from = models.ForeignKey(


This feels more like a OneToOneField than a ForeignKey. As I understand it, in the case of multiple regenerations, we'll be regenerating not from the original but from the previously generated dataset (think linked list instead of a parent node with multiple child nodes).

We'll keep this as a ForeignKey. A processed and expired dataset can be the source for multiple new datasets, and regenerated_from will simply point to the dataset it was generated from. I also applied your feedback on on_delete and updated related_name to regenerated_datasets.

avrohomgottlieb · 2025-01-14T21:46:11Z

api/scpca_portal/models/dataset.py

+    regenerated_from = models.ForeignKey(
+        "self",
+        null=True,
+        on_delete=models.CASCADE,


We should only utilize cascade deletion of related models if the models have a parent-child relationship (think project-sample). For models that exist either as standalone or as siblings, we should be setting their on_delete fields to models.SET_NULL.

Suggested change

on_delete=models.CASCADE,

on_delete=models.SET_NULL,

Great insight, thank you!

avrohomgottlieb · 2025-01-14T21:50:04Z

api/scpca_portal/models/dataset.py

+    format = models.TextField(choices=FileFormats.CHOICES, null=True)
+    email = models.EmailField(max_length=254, null=True)


As I understand it, neither of these fields can be nullable. Also the default email max length is already 254, so no need to set it explicitly.

Suggested change

format = models.TextField(choices=FileFormats.CHOICES, null=True)

email = models.EmailField(max_length=254, null=True)

format = models.TextField(choices=FileFormats.CHOICES)

email = models.EmailField()

Great catch on max_length, I’ve removed it! As for email, it can be null initially before the dataset starts processing, so we’ll keep it as is. This field will be used for notifications and is user-editable. We’ve grouped the fields as editable and non-editable using comments for clarity.

About format, I ran into this error during the migration,You are trying to add a non-nullable field 'format' to dataset without a default; we can't do that. So to work around it, I made the field nullable to avoid the database constraint issue.

Should we set the default to a placeholder string (e.g., default='None') for the field? Or, would it make sense to define a default placeholder in common.py for consistency, so we can standardize the default placeholder value for cases like this across the codebase (perhaps can we use NA)?

Let me know what you think!

That is because you already have created the table. You need to roll back your migration and re-run makemigrations and delete the old migration file so that it is a single file. You can have non-nullable columns on new tables. If you already have a table the ORM doesnt know if there will be rows which would become invalid by the migration.

I've reapplied the migration again and now the format field is no longer nullable. Thank you all!

avrohomgottlieb · 2025-01-14T22:00:35Z

api/scpca_portal/models/dataset.py

+    token = models.OneToOneField(
+        APIToken,
+        null=True,
+        on_delete=models.CASCADE,


Suggested change

on_delete=models.CASCADE,

on_delete=models.SET_NULL,

This should be foreignKey as well because multiple datasets can be started with any given token.

avrohomgottlieb · 2025-01-14T22:01:30Z

api/scpca_portal/models/dataset.py

+        related_name="dataset_token",
+        help_text="Token used to process the dataset.",
+    )
+    download_tokens = models.ManyToManyField(


Why do we think this is a ManyToManyField and not a ForeignKey (on APIToken)?

I think this is the correct relationship as a token can be used to download multiple datasets and every download will be tracked via the token.

We kept it as a ManyToManyField, but updated related_name to downloaded_datasets 👍

avrohomgottlieb · 2025-01-14T22:08:15Z

api/scpca_portal/models/dataset.py

+    )
+    start = models.BooleanField(
+        null=True,
+        help_text="Indicates if the dataset process has started.",


I'm feeling that if we add help_text for one BooleanField, then we should add it for the others. It's a conventions question.

That’s a great point and makes sense! David and I chatted, and we’re thinking of going over the codebase later to add help_text where needed. But for now, we’ll hold off since it’s not part of the current pattern - TBD for later.

avrohomgottlieb · 2025-01-14T22:10:30Z

api/scpca_portal/models/dataset.py

+    computed_file = models.OneToOneField(
+        ComputedFile,
+        null=True,
+        on_delete=models.CASCADE,


This one's more subtle. There is a parent-child relationship here, but in order to represent reality we'd have to also actually delete the file from s3. Until we have that set up, we should resort to SET_NULL.

Suggested change

on_delete=models.CASCADE,

on_delete=models.SET_NULL,

…nto dev

davidsmejia · 2025-01-15T21:22:18Z

api/scpca_portal/models/dataset.py

+    token = models.OneToOneField(
+        APIToken,
+        null=True,
+        on_delete=models.CASCADE,


This should be foreignKey as well because multiple datasets can be started with any given token.

davidsmejia · 2025-01-15T21:22:52Z

api/scpca_portal/models/dataset.py

+        related_name="dataset_token",
+        help_text="Token used to process the dataset.",
+    )
+    download_tokens = models.ManyToManyField(


I think this is the correct relationship as a token can be used to download multiple datasets and every download will be tracked via the token.

davidsmejia · 2025-01-15T21:24:00Z

api/scpca_portal/models/dataset.py

+    processed_at = models.DateTimeField(null=True)
+    is_processed = models.BooleanField(default=False)
+    expires_at = models.DateTimeField(null=True)
+    is_expired = models.BooleanField(default=False)


Can we add a comment there saying that this will be evaluated / set in a cron?

Great insights, thank you!

nozomione · 2025-01-15T23:04:46Z

I've applied both of your feedback, and this PR is ready for another look. Thank you again for the valuable insights!

davidsmejia

LGTM

avrohomgottlieb

Looks great Nozomi! 🚀

add an initial Dataset model and migrate

6645415

nozomione self-assigned this Jan 14, 2025

nozomione requested a review from davidsmejia as a code owner January 14, 2025 20:15

davidsmejia requested a review from avrohomgottlieb January 14, 2025 21:12

avrohomgottlieb requested changes Jan 14, 2025

View reviewed changes

nozomione added 4 commits January 14, 2025 19:44

rollback migration from 0055 to 0054

97194e2

Merge branch 'dev' of https://github.com/AlexsLemonade/scpca-portal i…

c4ede04

…nto dev

Merge branch 'dev' into nozomione/1046-add-dataset-model

2cd227a

re-apply the dataset model migration

4d103d3

vercel bot deployed to Preview January 15, 2025 00:54 View deployment

davidsmejia reviewed Jan 15, 2025

View reviewed changes

nozomione added 2 commits January 15, 2025 17:50

Merge branch 'dev' into nozomione/1046-add-dataset-model

f4aa89a

(edit) apply team PR feedback

6f89ee0

vercel bot deployed to Preview January 15, 2025 22:57 View deployment

nozomione requested review from davidsmejia and avrohomgottlieb January 15, 2025 23:04

(edit) apply feedback

270d2ce

vercel bot deployed to Preview January 16, 2025 21:38 View deployment

davidsmejia approved these changes Jan 17, 2025

View reviewed changes

avrohomgottlieb approved these changes Jan 17, 2025

View reviewed changes

nozomione merged commit 010edaf into dev Jan 17, 2025
5 checks passed

nozomione deleted the nozomione/1046-add-dataset-model branch January 17, 2025 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1046 - Add Dataset model #1052

1046 - Add Dataset model #1052

nozomione commented Jan 14, 2025 •

edited

Loading

avrohomgottlieb left a comment

avrohomgottlieb Jan 14, 2025

nozomione Jan 15, 2025

avrohomgottlieb Jan 14, 2025

davidsmejia Jan 15, 2025

nozomione Jan 15, 2025

avrohomgottlieb Jan 14, 2025

nozomione Jan 15, 2025 •

edited

Loading

avrohomgottlieb Jan 14, 2025

nozomione Jan 15, 2025

avrohomgottlieb Jan 14, 2025

nozomione Jan 15, 2025

avrohomgottlieb Jan 14, 2025

nozomione Jan 15, 2025 •

edited

Loading

nozomione Jan 16, 2025 •

edited

Loading

davidsmejia Jan 16, 2025 •

edited

Loading

nozomione Jan 16, 2025

avrohomgottlieb Jan 14, 2025

davidsmejia Jan 15, 2025

avrohomgottlieb Jan 14, 2025

davidsmejia Jan 15, 2025

nozomione Jan 15, 2025

avrohomgottlieb Jan 14, 2025 •

edited

Loading

nozomione Jan 15, 2025

avrohomgottlieb Jan 14, 2025 •

edited

Loading

davidsmejia Jan 15, 2025

davidsmejia Jan 15, 2025

davidsmejia Jan 15, 2025

nozomione Jan 15, 2025

nozomione commented Jan 15, 2025

davidsmejia left a comment

avrohomgottlieb left a comment

		format = models.TextField(choices=FileFormats.CHOICES, null=True)
		email = models.EmailField(max_length=254, null=True)

1046 - Add Dataset model #1052

1046 - Add Dataset model #1052

Conversation

nozomione commented Jan 14, 2025 • edited Loading

Issue Number

Purpose/Implementation Notes

Types of changes

Functional tests

Checklist

Screenshots

avrohomgottlieb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nozomione Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nozomione Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

nozomione Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

davidsmejia Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avrohomgottlieb Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avrohomgottlieb Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nozomione commented Jan 15, 2025

davidsmejia left a comment

Choose a reason for hiding this comment

avrohomgottlieb left a comment

Choose a reason for hiding this comment

nozomione commented Jan 14, 2025 •

edited

Loading

nozomione Jan 15, 2025 •

edited

Loading

nozomione Jan 15, 2025 •

edited

Loading

nozomione Jan 16, 2025 •

edited

Loading

davidsmejia Jan 16, 2025 •

edited

Loading

avrohomgottlieb Jan 14, 2025 •

edited

Loading

avrohomgottlieb Jan 14, 2025 •

edited

Loading