forked from CCBR/spacesavers2
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request CCBR#100 from CCBR/pdq_db
adding pdq db creation and data ingestion features
- Loading branch information
Showing
13 changed files
with
544 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
redirect |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
redirect |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
## spacesavers2_pdq_create_db | ||
|
||
pdq = Pretty Darn Quick | ||
|
||
[`spacesavers2_pdq`](pdq.md) creates TSV (or JSON) file per-datamount per-run (typically per-date). If run daily, this soon creates a lot of files to keep track of. Hence, it is best to save the data in a sqlite db. This command create the basic schema for that db. The schema looks like this: | ||
|
||
![pdq schema](assets/images/pdq_db_schema.png) | ||
|
||
### Inputs | ||
- `--filepath`: where to create the ".db" file. | ||
- `--overwrite`: toggle to overwrite if the ".db" file already exists. | ||
|
||
```bash | ||
usage: spacesavers2_pdq_create_db [-h] -f FILEPATH [-o | --overwrite | --no-overwrite] [-v] | ||
|
||
spacesavers2_pdq_create_db: create a sqlitedb file with the optimized schema. | ||
|
||
options: | ||
-h, --help show this help message and exit | ||
-f FILEPATH, --filepath FILEPATH | ||
spacesavers2_pdq_create_db will create this sqlitedb file | ||
-o, --overwrite, --no-overwrite | ||
overwrite output file if it already exists. Use this with caution as it will delete existing file and its contents!! | ||
-v, --version show program's version number and exit | ||
Version: | ||
v0.13.0-dev | ||
Example: | ||
> spacesavers2_pdq_create_db -f /path/to/sqlitedbfile | ||
``` | ||
### Output | ||
## db file | ||
sqlite ".db" file with 4 tables | ||
```bash | ||
% sqlite3 pdq.db | ||
SQLite version 3.26.0 2018-12-01 12:34:55 | ||
Enter ".help" for usage hints. | ||
sqlite> .table | ||
datamounts datapoints dates users | ||
sqlite> .schema | ||
CREATE TABLE users ( | ||
user_id INTEGER PRIMARY KEY, | ||
username TEXT NOT NULL, | ||
first_name TEXT NOT NULL, | ||
last_name TEXT NOT NULL | ||
); | ||
CREATE TABLE dates ( | ||
date_int INTEGER PRIMARY KEY, | ||
date_text TEXT UNIQUE NOT NULL | ||
); | ||
CREATE TABLE datamounts ( | ||
datamount_id INTEGER PRIMARY KEY, | ||
datamount_name TEXT UNIQUE NOT NULL | ||
); | ||
CREATE TABLE datapoints ( | ||
datapoint_id INTEGER PRIMARY KEY, | ||
date_int INTEGER, | ||
datamount_id INTEGER, | ||
user_id INTEGER, | ||
ninodes INTEGER, | ||
nbytes INTEGER, | ||
FOREIGN KEY (user_id) REFERENCES users(user_id), | ||
FOREIGN KEY (datamount_id) REFERENCES datamounts(datamount_id), | ||
FOREIGN KEY (date_int) REFERENCES dates(date_int) | ||
); | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
## spacesavers2_pdq_update_db | ||
|
||
pdq = Pretty Darn Quick | ||
|
||
[`spacesavers2_pdq`](pdq.md) creates TSV (or JSON) file per-datamount per-run (typically per-date). If run daily, this soon creates a lot of files to keep track of. Hence, it is best to save the data in a sqlite db. [`spacesavers2_pdq_create_db`](pdq_create_db.md) command creates the basic schema for that db. Then this command can be used to populate the database. | ||
|
||
![pdq schema](assets/images/pdq_db_schema.png) | ||
|
||
### Inputs | ||
- `--tsv`: `.tsv` or `.tsv.gz` created using `spacesavers2_pdq` | ||
- `--database`: `.db` file created using `spacesavers2_pdq_create_db` | ||
- `--datamount`: eg. `CCBR` or `CCBR_Pipeliner` | ||
- `--date`: integer date in YYYYMMDD format | ||
|
||
```bash | ||
usage: spacesavers2_pdq_update_db [-h] -t TSV -o DATABASE -m DATAMOUNT -d DATE [-v] | ||
|
||
spacesavers2_pdq_create_db: update/append date from TSV to DB file | ||
|
||
options: | ||
-h, --help show this help message and exit | ||
-t TSV, --tsv TSV spacesavers2_pdq output TSV file | ||
-o DATABASE, --database DATABASE | ||
database file path (use spacesavers2_pdb_create_db to create if it does not exists.) | ||
-m DATAMOUNT, --datamount DATAMOUNT | ||
name of the datamount eg. CCBR or CCBR_Pipeliner | ||
-d DATE, --date DATE date in YYYYMMDD integer format | ||
-v, --version show program's version number and exit | ||
Version: | ||
v0.13.0-dev | ||
Example: | ||
> spacesavers2_pdq_update_db -t /path/to/tsv -o /path/to/db -m datamount_name -d date | ||
``` | ||
### Output | ||
## updated db file | ||
sqlite ".db" file with 4 tables is updated. | ||
> NOTE: | ||
> | ||
> - new users are automatically added to "users" table | ||
> - new datemounts are automatically added to "datamounts" table | ||
> - new dates are automatically added to "dates" table | ||
> - if >0 datapoints exist in the ".db" for a (date + datamount) combination then warning is displayed and no data is appended |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Location to store extra scripts! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
#!/bin/bash | ||
# This script: | ||
# 1. creates a sqlite3 database using `spacesavers2_pdq_create_db` | ||
# 2. updates the database for "CCBR" mount related datapoints | ||
# 3. updates the database for "CCBR_Pipeliner" mount related datapoints | ||
|
||
module load ccbrpipeliner/6 | ||
BIN="/data/CCBR_Pipeliner/Tools/spacesavers2/pdq_db/bin" | ||
DB="/data/CCBR_Pipeliner/userdata/spacesavers2_pdq/pdq.db" | ||
|
||
if [[ "1" == "0" ]];then | ||
# Step 1. | ||
${BIN}/spacesavers2_pdq_create_db -f $DB | ||
fi | ||
|
||
# Step 2. | ||
for f in `ls /data/CCBR_Pipeliner/userdata/spacesavers2_pdq/_data_CCBR.*.tsv*` | ||
do | ||
bn=$(basename $f) | ||
echo $bn | ||
dt=$(echo $bn|awk -F"." '{print $2}') | ||
dm="CCBR" | ||
${BIN}/spacesavers2_pdq_update_db \ | ||
--tsv $f \ | ||
--database $DB \ | ||
--datamount $dm --date $dt | ||
done | ||
|
||
# Step 3. | ||
for f in `ls /data/CCBR_Pipeliner/userdata/spacesavers2_pdq/_data_CCBR_Pipeliner.*.tsv*` | ||
do | ||
bn=$(basename $f) | ||
echo $bn | ||
dt=$(echo $bn|awk -F"." '{print $2}') | ||
dm="CCBR_Pipeliner" | ||
${BIN}/spacesavers2_pdq_update_db \ | ||
--tsv $f \ | ||
--database $DB \ | ||
--datamount $dm --date $dt | ||
done |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
#!/usr/bin/env python3 | ||
# pqd = pretty darn quick | ||
|
||
from src.VersionCheck import version_check | ||
from src.VersionCheck import __version__ | ||
from src.utils import * | ||
|
||
version_check() | ||
|
||
# import required modules | ||
import sqlite3 | ||
import textwrap | ||
import argparse | ||
from pathlib import Path | ||
|
||
def main(): | ||
elog = textwrap.dedent( | ||
"""\ | ||
Version: | ||
{} | ||
Example: | ||
> spacesavers2_pdq_create_db -f /path/to/sqlitedbfile | ||
""".format( | ||
__version__ | ||
) | ||
) | ||
parser = argparse.ArgumentParser( | ||
description="spacesavers2_pdq_create_db: create a sqlitedb file with the optimized schema.", | ||
epilog=elog, | ||
formatter_class=argparse.RawDescriptionHelpFormatter, | ||
) | ||
parser.add_argument( | ||
"-f", | ||
"--filepath", | ||
dest="filepath", | ||
required=True, | ||
type=str, | ||
help="spacesavers2_pdq_create_db will create this sqlitedb file", | ||
) | ||
parser.add_argument( | ||
"-o", | ||
"--overwrite", | ||
dest="overwrite", | ||
required=False, | ||
action=argparse.BooleanOptionalAction, | ||
help="overwrite output file if it already exists. Use this with caution as it will delete existing file and its contents!!", | ||
) | ||
parser.add_argument("-v", "--version", action="version", version=__version__) | ||
|
||
global args | ||
args = parser.parse_args() | ||
|
||
filepath = args.filepath | ||
p = Path(filepath).absolute() | ||
pp = p.parents[0] | ||
if not os.access(pp, os.W_OK): | ||
exit("ERROR: {} folder exists but cannot be written to".format(pp)) | ||
if os.path.exists(p): | ||
if not args.overwrite: | ||
exit("ERROR: {} file exists and overwrite argument is not selected!".format(p)) | ||
if not os.access(p, os.W_OK): | ||
exit("ERROR: {} file exists but is not writeable/appendable".format(p)) | ||
if args.overwrite and os.access(p, os.W_OK): | ||
os.remove(p) | ||
|
||
# Connect to the SQLite database (or create it if it doesn't exist) | ||
conn = sqlite3.connect(p) | ||
cursor = conn.cursor() | ||
|
||
# Create the "users" table | ||
cursor.execute('''CREATE TABLE IF NOT EXISTS users ( | ||
user_id INTEGER PRIMARY KEY, | ||
username TEXT NOT NULL, | ||
first_name TEXT NOT NULL, | ||
last_name TEXT NOT NULL | ||
)''') | ||
|
||
# Create the "dates" table | ||
cursor.execute('''CREATE TABLE IF NOT EXISTS dates ( | ||
date_int INTEGER PRIMARY KEY, | ||
date_text TEXT UNIQUE NOT NULL | ||
)''') | ||
|
||
# Create datamounts table | ||
cursor.execute('''CREATE TABLE IF NOT EXISTS datamounts ( | ||
datamount_id INTEGER PRIMARY KEY, | ||
datamount_name TEXT UNIQUE NOT NULL | ||
)''') | ||
|
||
|
||
# Create the "orders" table with a foreign key constraint | ||
cursor.execute('''CREATE TABLE IF NOT EXISTS datapoints ( | ||
datapoint_id INTEGER PRIMARY KEY, | ||
date_int INTEGER, | ||
datamount_id INTEGER, | ||
user_id INTEGER, | ||
ninodes INTEGER, | ||
nbytes INTEGER, | ||
FOREIGN KEY (user_id) REFERENCES users(user_id), | ||
FOREIGN KEY (datamount_id) REFERENCES datamounts(datamount_id), | ||
FOREIGN KEY (date_int) REFERENCES dates(date_int) | ||
)''') | ||
|
||
# Commit changes and close the connection | ||
conn.commit() | ||
conn.close() | ||
|
||
if __name__ == "__main__": | ||
main() |
Oops, something went wrong.