Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: sgm-gharchive (gharchive -> parquet), local dev env: jupyterlab, postgres, pgadmin #119

Merged
merged 40 commits into from
Apr 6, 2024

Conversation

halcyondude
Copy link
Collaborator

No description provided.

Matt Young and others added 30 commits April 6, 2024 02:29
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
gh's markdown renderer ignores html comments, while juypterlab does not.
kind of a hack but seems to work.

Signed-off-by: Matt Young <[email protected]>
Remaining for sunbursts and treemaps:

* better colors
* per-tag reeports
* activity aggregates (issues, pr, releases, etc) from gharchive.

Signed-off-by: Matt Young <[email protected]>
…tebooks (cncf-landscape-nn-description.ipynb

Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
Remaining for sunbursts and treemaps:

* better colors
* per-tag reeports
* activity aggregates (issues, pr, releases, etc) from gharchive.

Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
consolidate .gz files into larger, per-event type .gz's.

Covering:
 2015-01-01 thru 2024-03-16

---

 46M	CommitCommentEvent-consolidated.gz
430M	CreateEvent-consolidated.gz
235M	DeleteEvent-consolidated.gz
2.1G	ForkEvent-consolidated.gz
 15M	GollumEvent-consolidated.gz
 18G	IssueCommentEvent-consolidated.gz
3.5G	IssuesEvent-consolidated.gz
 26M	MemberEvent-consolidated.gz
2.0M	PublicEvent-consolidated.gz
 18G	PullRequestEvent-consolidated.gz
 27G	PullRequestReviewCommentEvent-consolidated.gz
 19G	PullRequestReviewEvent-consolidated.gz
3.0G	PushEvent-consolidated.gz
170M	ReleaseEvent-consolidated.gz
1.1G	WatchEvent-consolidated.gz

---

~/gh/cncf/landscape-graph/db/scm/sgm-gharchive (my-kceu24 ✘)✖✹✭ ᐅ ./consolidate-gz.sh -s ~/gharchive-cncf/cncf.all -t ~/gharchive-cncf/cncf.byrepo  -v  | tee cncf-consolidate.log
====================================
  GZ File Consolidation Script
====================================

Source: /Users/matt/gharchive-cncf/cncf.all
Target: /Users/matt/gharchive-cncf/cncf.byrepo
Dry Run: 0
Verbose: 1
Processing directory: /Users/matt/gharchive-cncf/cncf.all/CommitCommentEvent
    dirName: CommitCommentEvent
    outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/CommitCommentEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/CommitCommentEvent into /Users/matt/gharchive-cncf/cncf.byrepo/CommitCommentEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/CreateEvent
    dirName: CreateEvent
    outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/CreateEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/CreateEvent into /Users/matt/gharchive-cncf/cncf.byrepo/CreateEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/DeleteEvent
    dirName: DeleteEvent
    outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/DeleteEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/DeleteEvent into /Users/matt/gharchive-cncf/cncf.byrepo/DeleteEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/ForkEvent
    dirName: ForkEvent
    outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/ForkEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/ForkEvent into /Users/matt/gharchive-cncf/cncf.byrepo/ForkEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/GollumEvent
    dirName: GollumEvent
    outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/GollumEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/GollumEvent into /Users/matt/gharchive-cncf/cncf.byrepo/GollumEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/IssueCommentEvent
    dirName: IssueCommentEvent
    outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/IssueCommentEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/IssueCommentEvent into /Users/matt/gharchive-cncf/cncf.byrepo/IssueCommentEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/IssuesEvent
    dirName: IssuesEvent
    outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/IssuesEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/IssuesEvent into /Users/matt/gharchive-cncf/cncf.byrepo/IssuesEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/MemberEvent
    dirName: MemberEvent
    outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/MemberEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/MemberEvent into /Users/matt/gharchive-cncf/cncf.byrepo/MemberEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/PublicEvent
    dirName: PublicEvent
    outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PublicEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PublicEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PublicEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/PullRequestEvent
    dirName: PullRequestEvent
    outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PullRequestEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/PullRequestReviewCommentEvent
    dirName: PullRequestReviewCommentEvent

Signed-off-by: Matt Young <[email protected]>
(.venv-ipynb) ~/gh/cncf/landscape-graph/util/system-perf (my-kceu24 ✘)✖✹✚✭ ᐅ ./test-filesystem.sh
         1 | (w) 2.27    MB/s (r) | (r) 7.69    GB/s | (w) 2325.58 MB/s | (r) 7874.02 GB/s
        10 | (w) 6.59    MB/s (r) | (r) 12.24   GB/s | (w) 6743.09 MB/s | (r) 12531.30 GB/s
       100 | (w) 7.15    MB/s (r) | (r) 13.47   GB/s | (w) 7317.43 MB/s | (r) 13796.90 GB/s
      1000 | (w) 7.32    MB/s (r) | (r) 13.67   GB/s | (w) 7496.59 MB/s | (r) 13996.40 GB/s
      2500 | (w) 6.69    MB/s (r) | (r) 13.57   GB/s | (w) 6848.30 MB/s | (r) 13900.30 GB/s
      5000 | (w) 5.95    MB/s (r) | (r) 13.64   GB/s | (w) 6088.29 MB/s | (r) 13969.80 GB/s
      7500 | (w) 6.21    MB/s (r) | (r) 13.70   GB/s | (w) 6357.82 MB/s | (r) 14028.30 GB/s
     10000 | (w) 6.21    MB/s (r) | (r) 5.58    GB/s | (w) 6361.99 MB/s | (r) 5719.03 GB/s
     12500 | (w) 6.12    MB/s (r) | (r) 5.47    GB/s | (w) 6267.29 MB/s | (r) 5598.75 GB/s
Test completed. Results saved to Darwin_DiskType_Apple_Model__Size_2.0TB_filesystem_performance.csv.

Signed-off-by: Matt Young <[email protected]>
… metadata for all types.

generated from org-list.txt, 2015-01-01 - 2024-03-16

~/gharchive-cncf/cncf-parquet ᐅ du -hs *.parquet | sort -h
392K	PublicEvent-consolidated.parquet
4.3M	GollumEvent-consolidated.parquet
5.9M	MemberEvent-consolidated.parquet
 21M	CommitCommentEvent-consolidated.parquet
 32M	DeleteEvent-consolidated.parquet
 47M	CreateEvent-consolidated.parquet
 62M	ReleaseEvent-consolidated.parquet
329M	WatchEvent-consolidated.parquet
480M	PullRequestReviewEvent-consolidated.parquet
1.3G	ForkEvent-consolidated.parquet
1.8G	IssuesEvent-consolidated.parquet
2.1G	PushEvent-consolidated.parquet
3.9G	PullRequestEvent-consolidated.parquet
4.6G	PullRequestReviewCommentEvent-consolidated.parquet
8.9G	IssueCommentEvent-consolidated.parquet

Signed-off-by: Matt Young <[email protected]>
…pact .schema files. Also add boilerplate for postgrest database ingest.

Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
…ocessing 2015-2024 (thru March)

Signed-off-by: Matt Young <[email protected]>
halcyondude and others added 10 commits April 6, 2024 02:29
feat: Enhance data processing and artifact generation

- Implement signal handling for graceful shutdown
- Introduce pandas optimizations for large datasets
- Add data cleaning and type optimization functions
- Implement JSON and Parquet artifact generation with partitioning
- Add multiprocessing support for parallel processing of data files
- Introduce error handling and logging for robustness
- Optimize memory usage and processing time for large-scale data
- Include utility functions for data type conversion and normalization
- Add command-line interface and environment variable support for configuration
- Ensure compatibility with large datasets exceeding memory capacity

add gharchive-gz-to-parquet.[py,sh,bat]

Signed-off-by: Matt Young <[email protected]>
also various nits and .gitignore bits.

Signed-off-by: Matt Young <[email protected]>
Signed-off-by: Matt Young <[email protected]>
@halcyondude halcyondude merged commit 391d38d into main Apr 6, 2024
2 checks passed
@halcyondude halcyondude deleted the my-kceu24 branch April 6, 2024 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant