-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide regular dumps of Trac database #231
Comments
Hey @bmispelon , thanks for moving this to an issue. I was looking for a way to get a dump from Trac to run data analyses on it. I wanted to measure a couple of points, but my big goal was to understand if something like https://clickpy.clickhouse.com/ is doable for Django issues and find ways to improve the dashboard.
|
Depends what you mean by "alive" 😁 . It is technically still working, but it's barely used (I think the fellows update the roadmap page, but that's all I know of).
Unfortunately the script (if there ever was one) is not available anymore. I tried to reach out to the person who I thought was uploading the dumps, but never got a reply.
Personally I think we should share as much of the data as possible, while preserving security (session ids for example) and users' privacy (email addresses). Basically if if the information is available publicly in some form, it should be included in the dump. I think that's where the hard part of this issue resides: figuring out which tables/columns are safe to share or not. Not sure if you've seen already, but we already share a (very limited) dump of the trac database with mostly just the tables and no data: https://github.com/django/djangoproject.com/blob/main/tracdb/trac.sql. If you come up with some scripts or queries you'd like to try out, don't hesitate to get in touch with me and I can run them on the live data. |
I’d find it useful personally as part of searching for things in Trac (have lots of trouble with the default experience), and checking people’s contribution history (for example as part of the Steering Council elections). Re hosting – is it a question of file size / bandwidth, or automation, or discoverability? 🤔 Depending on the answer, could be some of the DSF’s infrastructure and platforms, or something meant for analysts like Kaggle perhaps. For cleaning the data – I’d guess a table allowlist, and then anonymization steps for specific tables? See for example django-birdbath, not sure it’s the right fit here but that’s what my employer uses to help sharing database dumps without sharing the personal data that’s meant to stay in production only. |
At this point it's purely a technical issue about how to clean the data in a reproducible and automated way. I'll take a look at birdbath, thanks 👍🏻 |
hey@bmispelon
a. Remove or Anonymize Sensitive User Data: |
A long time ago, database dumps of the Trac tables used to be provided for public consumption but that practice has stopped at some point.
I think we should start doing this again (inspired by a discussion I had on Discord with @ulgens today). It would be useful both for people trying to work on code.djangoproject.com locally, but also those who'd like to extract statistics from Trac.
Some of the technical challenges to figure out:
The text was updated successfully, but these errors were encountered: