-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial commit of annotation service. #7
base: master
Are you sure you want to change the base?
Conversation
The pipeline has had a successful end-to-end run of annotating variants and inserting them into BQ. Many components are inflexible, ignore edge cases, or are otherwise brittle. Efforts have been made to document current issues with TODO comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again Jessime for this prototype. I have made a bunch of comments, mostly at a high level without getting into minor issues. I think we can check in this code with enough warnings in the README file; another option is to leave it as an open PR for someone to take over in future (like the gcp-variant-transforms
counterpart PR). Which way do you prefer?
import sys | ||
import uuid | ||
import requests | ||
import apache_beam as beam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this file belong to variant-annotation repo? I would have expected a pure client code here, i.e., clients_tests
above (e.g., to annotate a single VCF file using the server) not something that uses Beam. If you want to keep it this way maybe it should be moved elsewhere like gcp-variant-transforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of this file as my small integration test for the server, but I agree that it probably makes more sense to move to gcp-variant-transforms.
(I'll leave it in for now, but feel free to move it in the future.)
@@ -0,0 +1,98 @@ | |||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure you add the header doc strings before, especially the Apache licence related lines (and a short description for each module).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
except ValueError: | ||
raise ValueError(response.text) | ||
|
||
if __name__ == '__main__': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please add a usage text and print it when there are not two arguments.
with beam.Pipeline(options=options) as p: | ||
results = (p | beam.io.ReadFromText(infile) | ||
| beam.ParDo(KeyGenFn()) | ||
| beam.GroupByKey() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid this GroupByKey; maybe it is easier to remove this module from this repo for now and discuss this issue in the other PR in gcp-variant-transforms
repo.
'staging_location': 'gs://jessime_test_bucket/staging', | ||
'temp_location': 'gs://jessime_test_bucket/temp', | ||
'job_name': 'vep-as-a-service8', | ||
'setup_file': './setup.py', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is setup.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a specific reason why it makes more sense to keep this in gcp-variant-transforms. This setup.py file refers to Variant Transforms main setup.py file.
print 'VEP CACHE setup.' | ||
|
||
|
||
#Global calls after `setup_vep_cache` is defined but before `app` decorators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these need to be placed before app
decorators? Can we protect this block by an if
block such that this code is not executed when the module is imported? If not, is it possible to move these lines to a separate module that depends on this? I am asking because for example writing unit-tests in the current form is problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If you place
app = Flask(__name__)
afterapp
decorators, you'll get aNameError
, sinceapp
isn't defined yet. - It should be fine to put the
setup_vep_cache()
behindif __name__ == '__main__'
. Butapp = Flask(__name__)
might cause issues (with Gunicorn, for example, though I would have to investigate more).
@@ -0,0 +1,3 @@ | |||
# Annotation Server | |||
|
|||
## TODO: add descriptions of architecture, development, deployment, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, please add a note that this is a prototype that is not productionized yet but has been shown to scale well with Variant Transforms and add a link to your other PR: googlegenomics/gcp-variant-transforms#361
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've also added a few lines about how to launch the server.
service/server/main.py
Outdated
vep_file = 'vep_cache_homo_sapiens_GRCh38_91.tar.gz' | ||
gcs_vep_file = os.path.join(gcs_dir, vep_file) | ||
local_vep_file = os.path.join(local_cache, vep_file) | ||
check_call('gsutil cp {} {}'.format(gcs_vep_file, local_cache).split()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a TODO for handling errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
return header_names, annotation_lines | ||
|
||
|
||
def vep_tab_to_json(response): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems response
is both input and output for this function, correct? If yes, can you separate it and do not mutate the input. Also it would be more readable that any response
that is not over the network be renamed to something more descriptive, e.g., annotation_response
or maybe annotation_stdout
? I thought that should make understanding the code easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I already have a TODO to not mutate response
. I've added another to make the name more descriptive.
service/server/main.py
Outdated
"""Updates response dict to contain annotations grouped by their variants | ||
|
||
Args: | ||
response: Dict containing JSON response to send to client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an example for both input and output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Will have to modify slightly when refactoring to avoid mutation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only addressed the points I had time to address and made fixes that affected documentation but not code.
Looking forward to seeing this in production eventually!
@@ -0,0 +1,3 @@ | |||
# Annotation Server | |||
|
|||
## TODO: add descriptions of architecture, development, deployment, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've also added a few lines about how to launch the server.
import sys | ||
import uuid | ||
import requests | ||
import apache_beam as beam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of this file as my small integration test for the server, but I agree that it probably makes more sense to move to gcp-variant-transforms.
(I'll leave it in for now, but feel free to move it in the future.)
'staging_location': 'gs://jessime_test_bucket/staging', | ||
'temp_location': 'gs://jessime_test_bucket/temp', | ||
'job_name': 'vep-as-a-service8', | ||
'setup_file': './setup.py', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a specific reason why it makes more sense to keep this in gcp-variant-transforms. This setup.py file refers to Variant Transforms main setup.py file.
RUN perl INSTALL.pl \ | ||
--AUTO a \ | ||
--NO_UPDATE | ||
WORKDIR .. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. I initially started using both the Dockerfile
and the run_vep.sh
, but was running into issues. I unfortunately don't remember the specifics, but I was having trouble getting those to run on GAE with Flask, the Dockerfile in particular. The two lines that took the most effort to get right were:
FROM gcr.io/google_appengine/python
CMD gunicorn -b :$PORT -t 600 main:app
I eventually gave up trying to modify the existing files (going "top-down", if you will), and decided to go "bottom-up" by building off of GPC minimalist examples. My intention was to continue building up functionality, using the original files as endpoints and guides.
ENV VEP_SCRIPT ensembl-vep/vep | ||
ENV ASSEMBLY GRCh38 | ||
|
||
# Set virtualenv environment variables. This is equivalent to running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple of deleted lines back.
service/server/main.py
Outdated
vep_file = 'vep_cache_homo_sapiens_GRCh38_91.tar.gz' | ||
gcs_vep_file = os.path.join(gcs_dir, vep_file) | ||
local_vep_file = os.path.join(local_cache, vep_file) | ||
check_call('gsutil cp {} {}'.format(gcs_vep_file, local_cache).split()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
return header_names, annotation_lines | ||
|
||
|
||
def vep_tab_to_json(response): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I already have a TODO to not mutate response
. I've added another to make the name more descriptive.
service/server/main.py
Outdated
"""Updates response dict to contain annotations grouped by their variants | ||
|
||
Args: | ||
response: Dict containing JSON response to send to client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Will have to modify slightly when refactoring to avoid mutation.
@@ -0,0 +1,98 @@ | |||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
print 'VEP CACHE setup.' | ||
|
||
|
||
#Global calls after `setup_vep_cache` is defined but before `app` decorators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If you place
app = Flask(__name__)
afterapp
decorators, you'll get aNameError
, sinceapp
isn't defined yet. - It should be fine to put the
setup_vep_cache()
behindif __name__ == '__main__'
. Butapp = Flask(__name__)
might cause issues (with Gunicorn, for example, though I would have to investigate more).
The pipeline has had a successful end-to-end run of annotating variants
and inserting them into BQ. Many components are inflexible, ignore edge
cases, or are otherwise brittle. Efforts have been made to document
current issues with TODO comments.