-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: sgm-gharchive (gharchive -> parquet), local dev env: jupyterlab, postgres, pgadmin #119
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]> Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
gh's markdown renderer ignores html comments, while juypterlab does not. kind of a hack but seems to work. Signed-off-by: Matt Young <[email protected]>
Remaining for sunbursts and treemaps: * better colors * per-tag reeports * activity aggregates (issues, pr, releases, etc) from gharchive. Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
…tebooks (cncf-landscape-nn-description.ipynb Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Remaining for sunbursts and treemaps: * better colors * per-tag reeports * activity aggregates (issues, pr, releases, etc) from gharchive. Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
consolidate .gz files into larger, per-event type .gz's. Covering: 2015-01-01 thru 2024-03-16 --- 46M CommitCommentEvent-consolidated.gz 430M CreateEvent-consolidated.gz 235M DeleteEvent-consolidated.gz 2.1G ForkEvent-consolidated.gz 15M GollumEvent-consolidated.gz 18G IssueCommentEvent-consolidated.gz 3.5G IssuesEvent-consolidated.gz 26M MemberEvent-consolidated.gz 2.0M PublicEvent-consolidated.gz 18G PullRequestEvent-consolidated.gz 27G PullRequestReviewCommentEvent-consolidated.gz 19G PullRequestReviewEvent-consolidated.gz 3.0G PushEvent-consolidated.gz 170M ReleaseEvent-consolidated.gz 1.1G WatchEvent-consolidated.gz --- ~/gh/cncf/landscape-graph/db/scm/sgm-gharchive (my-kceu24 ✘)✖✹✭ ᐅ ./consolidate-gz.sh -s ~/gharchive-cncf/cncf.all -t ~/gharchive-cncf/cncf.byrepo -v | tee cncf-consolidate.log ==================================== GZ File Consolidation Script ==================================== Source: /Users/matt/gharchive-cncf/cncf.all Target: /Users/matt/gharchive-cncf/cncf.byrepo Dry Run: 0 Verbose: 1 Processing directory: /Users/matt/gharchive-cncf/cncf.all/CommitCommentEvent dirName: CommitCommentEvent outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/CommitCommentEvent-consolidated.gz Concatenating files from /Users/matt/gharchive-cncf/cncf.all/CommitCommentEvent into /Users/matt/gharchive-cncf/cncf.byrepo/CommitCommentEvent-consolidated.gz... Processing directory: /Users/matt/gharchive-cncf/cncf.all/CreateEvent dirName: CreateEvent outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/CreateEvent-consolidated.gz Concatenating files from /Users/matt/gharchive-cncf/cncf.all/CreateEvent into /Users/matt/gharchive-cncf/cncf.byrepo/CreateEvent-consolidated.gz... Processing directory: /Users/matt/gharchive-cncf/cncf.all/DeleteEvent dirName: DeleteEvent outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/DeleteEvent-consolidated.gz Concatenating files from /Users/matt/gharchive-cncf/cncf.all/DeleteEvent into /Users/matt/gharchive-cncf/cncf.byrepo/DeleteEvent-consolidated.gz... Processing directory: /Users/matt/gharchive-cncf/cncf.all/ForkEvent dirName: ForkEvent outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/ForkEvent-consolidated.gz Concatenating files from /Users/matt/gharchive-cncf/cncf.all/ForkEvent into /Users/matt/gharchive-cncf/cncf.byrepo/ForkEvent-consolidated.gz... Processing directory: /Users/matt/gharchive-cncf/cncf.all/GollumEvent dirName: GollumEvent outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/GollumEvent-consolidated.gz Concatenating files from /Users/matt/gharchive-cncf/cncf.all/GollumEvent into /Users/matt/gharchive-cncf/cncf.byrepo/GollumEvent-consolidated.gz... Processing directory: /Users/matt/gharchive-cncf/cncf.all/IssueCommentEvent dirName: IssueCommentEvent outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/IssueCommentEvent-consolidated.gz Concatenating files from /Users/matt/gharchive-cncf/cncf.all/IssueCommentEvent into /Users/matt/gharchive-cncf/cncf.byrepo/IssueCommentEvent-consolidated.gz... Processing directory: /Users/matt/gharchive-cncf/cncf.all/IssuesEvent dirName: IssuesEvent outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/IssuesEvent-consolidated.gz Concatenating files from /Users/matt/gharchive-cncf/cncf.all/IssuesEvent into /Users/matt/gharchive-cncf/cncf.byrepo/IssuesEvent-consolidated.gz... Processing directory: /Users/matt/gharchive-cncf/cncf.all/MemberEvent dirName: MemberEvent outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/MemberEvent-consolidated.gz Concatenating files from /Users/matt/gharchive-cncf/cncf.all/MemberEvent into /Users/matt/gharchive-cncf/cncf.byrepo/MemberEvent-consolidated.gz... Processing directory: /Users/matt/gharchive-cncf/cncf.all/PublicEvent dirName: PublicEvent outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PublicEvent-consolidated.gz Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PublicEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PublicEvent-consolidated.gz... Processing directory: /Users/matt/gharchive-cncf/cncf.all/PullRequestEvent dirName: PullRequestEvent outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestEvent-consolidated.gz Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PullRequestEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestEvent-consolidated.gz... Processing directory: /Users/matt/gharchive-cncf/cncf.all/PullRequestReviewCommentEvent dirName: PullRequestReviewCommentEvent Signed-off-by: Matt Young <[email protected]>
(.venv-ipynb) ~/gh/cncf/landscape-graph/util/system-perf (my-kceu24 ✘)✖✹✚✭ ᐅ ./test-filesystem.sh 1 | (w) 2.27 MB/s (r) | (r) 7.69 GB/s | (w) 2325.58 MB/s | (r) 7874.02 GB/s 10 | (w) 6.59 MB/s (r) | (r) 12.24 GB/s | (w) 6743.09 MB/s | (r) 12531.30 GB/s 100 | (w) 7.15 MB/s (r) | (r) 13.47 GB/s | (w) 7317.43 MB/s | (r) 13796.90 GB/s 1000 | (w) 7.32 MB/s (r) | (r) 13.67 GB/s | (w) 7496.59 MB/s | (r) 13996.40 GB/s 2500 | (w) 6.69 MB/s (r) | (r) 13.57 GB/s | (w) 6848.30 MB/s | (r) 13900.30 GB/s 5000 | (w) 5.95 MB/s (r) | (r) 13.64 GB/s | (w) 6088.29 MB/s | (r) 13969.80 GB/s 7500 | (w) 6.21 MB/s (r) | (r) 13.70 GB/s | (w) 6357.82 MB/s | (r) 14028.30 GB/s 10000 | (w) 6.21 MB/s (r) | (r) 5.58 GB/s | (w) 6361.99 MB/s | (r) 5719.03 GB/s 12500 | (w) 6.12 MB/s (r) | (r) 5.47 GB/s | (w) 6267.29 MB/s | (r) 5598.75 GB/s Test completed. Results saved to Darwin_DiskType_Apple_Model__Size_2.0TB_filesystem_performance.csv. Signed-off-by: Matt Young <[email protected]>
… metadata for all types. generated from org-list.txt, 2015-01-01 - 2024-03-16 ~/gharchive-cncf/cncf-parquet ᐅ du -hs *.parquet | sort -h 392K PublicEvent-consolidated.parquet 4.3M GollumEvent-consolidated.parquet 5.9M MemberEvent-consolidated.parquet 21M CommitCommentEvent-consolidated.parquet 32M DeleteEvent-consolidated.parquet 47M CreateEvent-consolidated.parquet 62M ReleaseEvent-consolidated.parquet 329M WatchEvent-consolidated.parquet 480M PullRequestReviewEvent-consolidated.parquet 1.3G ForkEvent-consolidated.parquet 1.8G IssuesEvent-consolidated.parquet 2.1G PushEvent-consolidated.parquet 3.9G PullRequestEvent-consolidated.parquet 4.6G PullRequestReviewCommentEvent-consolidated.parquet 8.9G IssueCommentEvent-consolidated.parquet Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
…pact .schema files. Also add boilerplate for postgrest database ingest. Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
…gres, ...) Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
…ocessing 2015-2024 (thru March) Signed-off-by: Matt Young <[email protected]>
feat: Enhance data processing and artifact generation - Implement signal handling for graceful shutdown - Introduce pandas optimizations for large datasets - Add data cleaning and type optimization functions - Implement JSON and Parquet artifact generation with partitioning - Add multiprocessing support for parallel processing of data files - Introduce error handling and logging for robustness - Optimize memory usage and processing time for large-scale data - Include utility functions for data type conversion and normalization - Add command-line interface and environment variable support for configuration - Ensure compatibility with large datasets exceeding memory capacity add gharchive-gz-to-parquet.[py,sh,bat] Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
also various nits and .gitignore bits. Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.