Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: sgm-gharchive (gharchive -> parquet), local dev env: jupyterlab, postgres, pgadmin #119

Merged
merged 40 commits into from
Apr 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
93c544f
KubeCon 2022 --> KubeCon 2023. Basic reports.
Nov 14, 2023
2003899
remove misc file
Nov 16, 2023
0a96d2c
wip: updates and conf talk dump from sched
Nov 20, 2023
e0a2a15
[chore] ignore { .jupyter/, .virtual_documents/ }
Nov 20, 2023
06214cd
kcna23 event export
Nov 20, 2023
1f32a05
Render SVG's on github, but not locally
Nov 20, 2023
b58c97f
sunbursts and treemaps via plotly.
Nov 22, 2023
00e2102
remove .json (accidently added)
Nov 22, 2023
740f01c
updated landscape-2023.ipynb
Nov 22, 2023
8a117d7
Updates (merging multiple laptops/repos)
Nov 27, 2023
343f5ab
more refactoring, started using alair, begun construction of final no…
Nov 29, 2023
6e6324f
updates
Nov 29, 2023
0da23fe
gha --> feather using streaming json parser
halcyondude Nov 30, 2023
8cddbb4
gha --> feather, nits and cleanup
halcyondude Nov 30, 2023
4755d91
add screenshot of gya.py running successfully.
halcyondude Nov 30, 2023
a0883ab
updates: landscape -> generated/cncf-projects.feather
halcyondude Dec 6, 2023
a814aa4
KubeCon 2022 --> KubeCon 2023. Basic reports.
Nov 14, 2023
a2e5bc3
[chore] ignore { .jupyter/, .virtual_documents/ }
Nov 20, 2023
14384f2
sunbursts and treemaps via plotly.
Nov 22, 2023
de52c55
[sgm-gharchive] initial implementation
Mar 19, 2024
a8167fc
[util] fix crlf
Mar 19, 2024
412303c
[sgm-gharchive] consolidate-gz.sh
Mar 20, 2024
66f3c89
[util] filesystem quick perf test
Mar 20, 2024
0c93f30
[sgm-gharchive] Add processing scripts, generated schema, and parquet…
Mar 20, 2024
e22cdc7
KubeCon 2022 --> KubeCon 2023. Basic reports.
Nov 14, 2023
a1b2837
[sgm-gharchive] Add markdown docs for all event types, as well as com…
halcyondude Mar 21, 2024
51896d4
[chore] notebooks/setup-local.sh, fix line endings
halcyondude Apr 2, 2024
0f53421
[jlab] base docker compose and intiial services (pgadmin, pgweb, post…
halcyondude Apr 2, 2024
383f5a5
[chore] .gitignore update.
halcyondude Mar 26, 2024
9c682ed
[sgm-gharchive] add gharchive-gz-hour2day.py w/ notebook capturing pr…
halcyondude Apr 3, 2024
c54e7d6
feat: Process raw gharchive.org archives (.gz) --> parquet w/ filtering
halcyondude Apr 3, 2024
1a41afa
chore: db/scm/sgm-gharchive --> db/sgm-gharchive
halcyondude Apr 3, 2024
d9989c5
feat: add Quarto notebook template
halcyondude Apr 3, 2024
e278f9a
[docs] Apache Arrow
halcyondude Apr 6, 2024
6630651
[docs] python module naming conventions
halcyondude Apr 6, 2024
c6dc83d
KubeCon 2022 --> KubeCon 2023. Basic reports.
Nov 14, 2023
5167f25
KubeCon 2022 --> KubeCon 2023. Basic reports.
Nov 14, 2023
47bad4c
KubeCon 2022 --> KubeCon 2023. Basic reports.
Nov 14, 2023
c4423fd
chore: db/scm/sgm-gharchive --> db/sgm-gharchive
halcyondude Apr 3, 2024
2f302a4
KubeCon 2022 --> KubeCon 2023. Basic reports.
Nov 14, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 84 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,10 +1,91 @@
# ignore system files
.venv*
.venv
venv
venv-lg
ve-*
venv-ipynb
.venv-ipynb
*.svg
.vscode/
.env
.DS_Store
node_modules
.mvn
.idea
.vscode

.venv-ipynb
.venv


__pycache__

.metals

# JupyterLab
.lsp_symlink
.cache/
.config/

__pycache__

.metals

# JupyterLab
.lsp_symlink
.cache/
.config/
.ipynb_checkpoints/
.ipython/
.jupyter/
.local/
.npm/
.visualpython/
.memestra/
.mypy_cache/
.virtual_documents/
node_modules/
# *.arrow
# *.csv
*.gz

_versions.env

# icon archive
.venv-icons/

*.docker-build.log


# guacsec/guac
guac/guac-compose/container_files/pg

data/sbom/downloaded-sboms/*




.virtual_documents/
node_modules/
# *.arrow
# *.csv
*.gz

_versions.env

# icon archive
.venv-icons/

*.docker-build.log


# guacsec/guac
guac/guac-compose/container_files/pg

data/sbom/downloaded-sboms/*




.venv*
.ipynb_checkpoints/
.env

5 changes: 5 additions & 0 deletions all-down.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/usr/bin/env bash
#set -x

. base.sh
docker compose "${FOSSTOOLS_ENV_FILES[@]}" down --remove-orphans # --detach
5 changes: 5 additions & 0 deletions all-up.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/usr/bin/env bash
# set -x
. base.sh
docker compose "${FOSSTOOLS_ENV_FILES[@]}" up --detach --force-recreate --remove-orphans
docker compose "${FOSSTOOLS_ENV_FILES[@]}" logs --follow
6 changes: 6 additions & 0 deletions base.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
FOSSTOOLS_ENV_FILES=(--env-file versions.default.env --env-file db.default.env)

# construct --build-arg argument(s) from _versions.env
buildargs_awkout=$(awk '/^[^#]/ {printf " --build-arg " $0} END { print "" }' _versions.env)
buildargs="BUILDARGS=( $buildargs_awkout )"
eval $buildargs
21 changes: 21 additions & 0 deletions build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/usr/bin/env bash

# set -x

# create/update _versions.env
bash build.update-versions.sh

. base.sh

echo "---"
echo "${BUILDARGS[@]}"
echo "---"

docker compose --env-file _versions.env build --no-cache --pull "${BUILDARGS[@]}"

# create list of installed packages w/ versions
docker run --rm ${USER}/fossjlab:dev pip3 list > jlab/requirements.jlab.list

# create pip freeze file
docker run --rm ${USER}/fossjlab:dev pip3 freeze > jlab/requirements.jlab.lock

37 changes: 37 additions & 0 deletions build.update-versions.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/usr/bin/env bash

#set -x

VERSION="v0.7.0"
SHORTSHA=$(git rev-parse --short HEAD)
LONGSHA=$(git rev-parse HEAD)

TAG_VERSION=":$VERSION"
TAG_SHORTSHA=":$SHORTSHA"
TAG_LONGSHA=":$LONGSHA"

BUILD_WHEN=$(date)
BUILD_WHEN_UTC=$(date -u +"%Y-%m-%dT%H-%M-%SZ")
COMMENT="hostname: $(hostname), who=$(whoami)"

cat <<EOF_VERSIONS > _versions.env
#
# This file is generated by build.update-versions.sh
#

VERSION="$VERSION"
SHORTSHA="$SHORTSHA"
LONGSHA="$LONGSHA"

# used to tag docker images when built from docker compose build (via ./build.sh)
TAG_VERSION="$TAG_VERSION"
TAG_SHORTSHA="$TAG_SHORTSHA"
TAG_LONGSHA="$TAG_LONGSHA"

# injected as labels into the docker image
BUILD_WHEN="$BUILD_WHEN"
BUILD_WHEN_UTC="$BUILD_WHEN_UTC"
COMMENT="$COMMENT"
EOF_VERSIONS


28 changes: 28 additions & 0 deletions cleanup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env bash

set -x

. base.sh

docker compose "${FOSSTOOLS_ENV_FILES[@]}" down --remove-orphans --volumes
docker compose "${FOSSTOOLS_ENV_FILES[@]}" rm --stop --volumes --force

docker image rm $(docker image ls --format '{{.Repository}}:{{.Tag}}' | grep 'foss')

rm -rvf _versions.env
rm -rvf .cache
rm -rvf .config
rm -rvf .ipython
rm -rvf .jupyter
rm -rvf .npm
rm -rvf .local
rm -rvf .visualpython
rm -rvf .virtual_documents
rm -rvf .memestra
rm -rvf .mypy_cache

# created whenever a file is modified.
find -d . -name .ipynb_checkpoints | xargs -0 rm -v -rf "{}"

ls -laF ~/Library/jupyter
rm -rf ~/Library/jupyter
1 change: 1 addition & 0 deletions data/logos/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.venv*
8 changes: 8 additions & 0 deletions db.default.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
POSTGRES_HOST="host.docker.internal"
POSTGRES_PORT=15432
POSTGRES_DB="cncfgraph"
POSTGRES_SCHEMA="public"
POSTGRES_USER="postgres"
POSTGRES_PASSWORD="password"

CNCFGRAPH_PORT=3000
69 changes: 69 additions & 0 deletions db/scm/sgm-gharchive/cncf-consolidate.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
====================================
GZ File Consolidation Script
====================================

Source: /Users/matt/gharchive-cncf/cncf.all
Target: /Users/matt/gharchive-cncf/cncf.byrepo
Dry Run: 0
Verbose: 1
Processing directory: /Users/matt/gharchive-cncf/cncf.all/CommitCommentEvent
dirName: CommitCommentEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/CommitCommentEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/CommitCommentEvent into /Users/matt/gharchive-cncf/cncf.byrepo/CommitCommentEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/CreateEvent
dirName: CreateEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/CreateEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/CreateEvent into /Users/matt/gharchive-cncf/cncf.byrepo/CreateEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/DeleteEvent
dirName: DeleteEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/DeleteEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/DeleteEvent into /Users/matt/gharchive-cncf/cncf.byrepo/DeleteEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/ForkEvent
dirName: ForkEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/ForkEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/ForkEvent into /Users/matt/gharchive-cncf/cncf.byrepo/ForkEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/GollumEvent
dirName: GollumEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/GollumEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/GollumEvent into /Users/matt/gharchive-cncf/cncf.byrepo/GollumEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/IssueCommentEvent
dirName: IssueCommentEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/IssueCommentEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/IssueCommentEvent into /Users/matt/gharchive-cncf/cncf.byrepo/IssueCommentEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/IssuesEvent
dirName: IssuesEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/IssuesEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/IssuesEvent into /Users/matt/gharchive-cncf/cncf.byrepo/IssuesEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/MemberEvent
dirName: MemberEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/MemberEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/MemberEvent into /Users/matt/gharchive-cncf/cncf.byrepo/MemberEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/PublicEvent
dirName: PublicEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PublicEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PublicEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PublicEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/PullRequestEvent
dirName: PullRequestEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PullRequestEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/PullRequestReviewCommentEvent
dirName: PullRequestReviewCommentEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestReviewCommentEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PullRequestReviewCommentEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestReviewCommentEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/PullRequestReviewEvent
dirName: PullRequestReviewEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestReviewEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PullRequestReviewEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PullRequestReviewEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/PushEvent
dirName: PushEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/PushEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/PushEvent into /Users/matt/gharchive-cncf/cncf.byrepo/PushEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/ReleaseEvent
dirName: ReleaseEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/ReleaseEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/ReleaseEvent into /Users/matt/gharchive-cncf/cncf.byrepo/ReleaseEvent-consolidated.gz...
Processing directory: /Users/matt/gharchive-cncf/cncf.all/WatchEvent
dirName: WatchEvent
outputFile: /Users/matt/gharchive-cncf/cncf.byrepo/WatchEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/cncf.all/WatchEvent into /Users/matt/gharchive-cncf/cncf.byrepo/WatchEvent-consolidated.gz...
Concatenation complete.
36 changes: 36 additions & 0 deletions db/scm/sgm-gharchive/cncf-gharchive-concat-daily.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/bin/bash

handle_sigint() {
echo "Caught Ctrl+C, stopping..."
# Perform any necessary cleanup here
exit 1
}

# Trap SIGINT and call handle_sigint when it's received
trap 'handle_sigint' SIGINT

set -euox pipefail

# ᐅ ./gharchive-concat-daily.sh --help
# Usage: ./gharchive-concat-daily.sh [options]

# Options:
# -s, --source <dir> Source directory (required)
# -t, --target <dir> Target directory (required)
# -d, --dry-run Perform a dry run without creating files
# -v, --verbose Enable verbose output
# -f, --fast-mode Use faster but less resilient to mix-match compression, concatenation (cat) method
# -p, --use-pigz Use pigz instead of gzip for compression
# -r, --report Generate a report with line counts
# -h, --help Display this help text


# ./gharchive-concat-daily.sh --source ~/gharchive-cncf/debug.cncf.all \
# --target ~/gharchive-cncf/debug.cncf.byrepo \
# --verbose \
# --fast-mode > gharchive-concat-daily.log

./gharchive-concat-daily.sh --source ~/gharchive-cncf/debug.cncf.all \
--target ~/gharchive-cncf/debug.cncf.byrepo \
--verbose \
--fast-mode
17 changes: 17 additions & 0 deletions db/scm/sgm-gharchive/consolidate-gz.debug.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
====================================
GZ File Consolidation Script
====================================

====================================
GZ File Consolidation Script
====================================

Source: /Users/matt/gharchive-cncf/debug.cncf.all
Target: /Users/matt/gharchive-cncf/debug.cncf.byrepo
Dry Run: 0
Verbose: 1
Processing directory: /Users/matt/gharchive-cncf/debug.cncf.all/CommitCommentEvent
dirName: CommitCommentEvent
outputFile: /Users/matt/gharchive-cncf/debug.cncf.byrepo/CommitCommentEvent-consolidated.gz
Concatenating files from /Users/matt/gharchive-cncf/debug.cncf.all/CommitCommentEvent into /Users/matt/gharchive-cncf/debug.cncf.byrepo/CommitCommentEvent-consolidated.gz...
Concatenation complete.
5 changes: 5 additions & 0 deletions db/scm/sgm-gharchive/gharchive-concat-daily.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
=============================================================
GitHub Archive: combine daily archives into per repo archives
=============================================================

Creating target directory: /Users/matt/gharchive-cncf/cncf.byrepo
Binary file not shown.
Loading
Loading