Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] stash: state of SILO executable for DEMO #58

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions ww_test/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
To generate the input data run `./generate_silo_input.bash`.

This downloads all data from the loculus instance wise-seqs.loculus.org,
looks at the short-read s3Link, downloads all s3 buckets where the file
ends with .ndjson.bz2 and merges them into a single .ndjson file.

To build the indexes and start the api, run `LAPIS_PORT=8080 docker compose up` where
you can replace the `LAPIS_PORT`.

This builds the SILO indexes (service `siloPreprocessing`),
starts the silo api (service `silo`) and the LAPIS api (service `lapis`).

The GUI to the API can be accessed at:
`http://localhost:8080/swagger-ui/index.html`

Prerequisites:
- installed Docker Compose
- install jq (on Ubuntu: `sudo apt-get install jq`)
- install aws cli (on Ubuntu: `sudo apt install awscli`)
- authenticate aws cli for S3 bucket containing files (`aws configure`)
45 changes: 45 additions & 0 deletions ww_test/database_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
schema:
metadata:
- name: sample_id
type: string
generateIndex: false
- name: batch_id
type: string
generateIndex: false
- name: sequencing_well_position
type: string
generateIndex: false
- name: location_code
type: string
generateIndex: false
- name: sampling_date
type: date
generateIndex: false
- name: sequencing_date
type: string
generateIndex: false
- name: flow_cell_serial_number
type: string
generateIndex: false
- name: read_length
type: int
generateIndex: false
- name: primer_protocol
type: string
generateIndex: false
- name: location_name
type: string
generateIndex: false
- name: primer_protocol_name
type: string
generateIndex: false
- name: nextclade_reference
type: string
generateIndex: false
- name: read_id
type: string
generateIndex: false
opennessLevel: OPEN
instanceName: wise-sarsCoV2
features: []
primaryKey: read_id
50 changes: 50 additions & 0 deletions ww_test/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
services:
lapisOpen:
image: ghcr.io/genspectrum/lapis:0.3
ports:
- ${LAPIS_PORT}:8080
command: --silo.url=http://silo:8081
volumes:
- type: bind
source: ./database_config.yaml
target: /workspace/database_config.yaml
read_only: true
- type: bind
source: ./reference_genomes.json
target: /workspace/reference_genomes.json
read_only: true
stop_grace_period: 0s

silo:
image: ghcr.io/genspectrum/lapis-silo:0.5
ports:
- 8081:8081
command: api
volumes:
- type: bind
source: ./silo_output
target: /data
read_only: true
depends_on:
siloPreprocessing:
condition: service_completed_successfully
stop_grace_period: 0s

siloPreprocessing:
image: ghcr.io/genspectrum/lapis-silo:0.5
command: preprocessing
mem_limit: 4g
volumes:
- type: bind
source: ./silo_output
target: /preprocessing/output
read_only: false
- type: bind
source: ./preprocessing_config.yaml
target: /app/preprocessing_config.yaml
read_only: true
- type: bind
source: ./
target: /preprocessing/input
read_only: false
stop_grace_period: 0s
47 changes: 47 additions & 0 deletions ww_test/generate_silo_input.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/bin/bash

# This script generates the input data for SILO.
# I use this for testing how it could be used in Loculus to extract all the short-read data from S3

# Input file with metadata
INPUT_FILE="loculus_output.ndjson"

# Output file for the combined ndjson
OUTPUT_FILE="data.ndjson"

# Temporary file for storing S3 links
S3_LINKS_FILE="s3_links.txt"

curl -X 'GET' \
'https://backend-wise-seqs.loculus.org/test/get-released-data' \
-H 'accept: application/x-ndjson' \
-H 'x-request-id: 1747481c-816c-4b60-af20-a61717a35067' > "$INPUT_FILE"

# Extract S3 links from the metadata
jq -r '.metadata.s3Link | select(test("\\.ndjson.bz2$"))' "$INPUT_FILE" > "$S3_LINKS_FILE"

rm "$INPUT_FILE"

# Ensure the output file is empty
rm "$OUTPUT_FILE" 2> /dev/null
touch "$OUTPUT_FILE"

# Loop through each S3 link and append the content to the output file
while read -r S3_LINK; do
# Temporary file for downloaded content
TEMP_FILE_COMPRESSED=$(mktemp)
TEMP_FILE_UNCOMPRESSED=$(mktemp)

# Download the ndjson file from S3
aws s3 cp "$S3_LINK" "$TEMP_FILE_COMPRESSED"

bunzip2 -dc "$TEMP_FILE_COMPRESSED" > "$TEMP_FILE_UNCOMPRESSED"

# Append the content to the output file
cat "$TEMP_FILE_UNCOMPRESSED" >> "$OUTPUT_FILE"

# Clean up the temporary file
rm "$TEMP_FILE_UNCOMPRESSED"
done < "$S3_LINKS_FILE"

rm "$S3_LINKS_FILE"
1 change: 1 addition & 0 deletions ww_test/preprocessing_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ndjsonInputFilename: data.ndjson
58 changes: 58 additions & 0 deletions ww_test/reference_genomes.json

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions ww_test/silo_output/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
This directory will be filled with indexes built by silo.
These will be directories whose names are unix timestamps as integers (e.g. `1732696600`).
Loading