Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit of annotation service. #7

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Jessime
Copy link

@Jessime Jessime commented Sep 18, 2018

The pipeline has had a successful end-to-end run of annotating variants
and inserting them into BQ. Many components are inflexible, ignore edge
cases, or are otherwise brittle. Efforts have been made to document
current issues with TODO comments.

The pipeline has had a successful end-to-end run of annotating variants
and inserting them into BQ. Many components are inflexible, ignore edge
cases, or are otherwise brittle. Efforts have been made to document
current issues with TODO comments.
@Jessime Jessime requested a review from bashir2 September 18, 2018 13:59
Copy link
Member

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again Jessime for this prototype. I have made a bunch of comments, mostly at a high level without getting into minor issues. I think we can check in this code with enough warnings in the README file; another option is to leave it as an open PR for someone to take over in future (like the gcp-variant-transforms counterpart PR). Which way do you prefer?

import sys
import uuid
import requests
import apache_beam as beam
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this file belong to variant-annotation repo? I would have expected a pure client code here, i.e., clients_tests above (e.g., to annotate a single VCF file using the server) not something that uses Beam. If you want to keep it this way maybe it should be moved elsewhere like gcp-variant-transforms.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of this file as my small integration test for the server, but I agree that it probably makes more sense to move to gcp-variant-transforms.

(I'll leave it in for now, but feel free to move it in the future.)

@@ -0,0 +1,98 @@
import sys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure you add the header doc strings before, especially the Apache licence related lines (and a short description for each module).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

except ValueError:
raise ValueError(response.text)

if __name__ == '__main__':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please add a usage text and print it when there are not two arguments.

with beam.Pipeline(options=options) as p:
results = (p | beam.io.ReadFromText(infile)
| beam.ParDo(KeyGenFn())
| beam.GroupByKey()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid this GroupByKey; maybe it is easier to remove this module from this repo for now and discuss this issue in the other PR in gcp-variant-transforms repo.

'staging_location': 'gs://jessime_test_bucket/staging',
'temp_location': 'gs://jessime_test_bucket/temp',
'job_name': 'vep-as-a-service8',
'setup_file': './setup.py',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is setup.py?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a specific reason why it makes more sense to keep this in gcp-variant-transforms. This setup.py file refers to Variant Transforms main setup.py file.

print 'VEP CACHE setup.'


#Global calls after `setup_vep_cache` is defined but before `app` decorators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these need to be placed before app decorators? Can we protect this block by an if block such that this code is not executed when the module is imported? If not, is it possible to move these lines to a separate module that depends on this? I am asking because for example writing unit-tests in the current form is problematic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If you place app = Flask(__name__) after app decorators, you'll get a NameError, since app isn't defined yet.
  2. It should be fine to put the setup_vep_cache() behind if __name__ == '__main__'. But app = Flask(__name__) might cause issues (with Gunicorn, for example, though I would have to investigate more).

@@ -0,0 +1,3 @@
# Annotation Server

## TODO: add descriptions of architecture, development, deployment, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, please add a note that this is a prototype that is not productionized yet but has been shown to scale well with Variant Transforms and add a link to your other PR: googlegenomics/gcp-variant-transforms#361

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've also added a few lines about how to launch the server.

vep_file = 'vep_cache_homo_sapiens_GRCh38_91.tar.gz'
gcs_vep_file = os.path.join(gcs_dir, vep_file)
local_vep_file = os.path.join(local_cache, vep_file)
check_call('gsutil cp {} {}'.format(gcs_vep_file, local_cache).split())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a TODO for handling errors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return header_names, annotation_lines


def vep_tab_to_json(response):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems response is both input and output for this function, correct? If yes, can you separate it and do not mutate the input. Also it would be more readable that any response that is not over the network be renamed to something more descriptive, e.g., annotation_response or maybe annotation_stdout? I thought that should make understanding the code easier.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I already have a TODO to not mutate response. I've added another to make the name more descriptive.

"""Updates response dict to contain annotations grouped by their variants

Args:
response: Dict containing JSON response to send to client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an example for both input and output.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Will have to modify slightly when refactoring to avoid mutation.

Copy link
Author

@Jessime Jessime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only addressed the points I had time to address and made fixes that affected documentation but not code.

Looking forward to seeing this in production eventually!

@@ -0,0 +1,3 @@
# Annotation Server

## TODO: add descriptions of architecture, development, deployment, etc.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've also added a few lines about how to launch the server.

import sys
import uuid
import requests
import apache_beam as beam
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of this file as my small integration test for the server, but I agree that it probably makes more sense to move to gcp-variant-transforms.

(I'll leave it in for now, but feel free to move it in the future.)

'staging_location': 'gs://jessime_test_bucket/staging',
'temp_location': 'gs://jessime_test_bucket/temp',
'job_name': 'vep-as-a-service8',
'setup_file': './setup.py',
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a specific reason why it makes more sense to keep this in gcp-variant-transforms. This setup.py file refers to Variant Transforms main setup.py file.

RUN perl INSTALL.pl \
--AUTO a \
--NO_UPDATE
WORKDIR ..
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. I initially started using both the Dockerfile and the run_vep.sh, but was running into issues. I unfortunately don't remember the specifics, but I was having trouble getting those to run on GAE with Flask, the Dockerfile in particular. The two lines that took the most effort to get right were:

FROM gcr.io/google_appengine/python
CMD gunicorn -b :$PORT -t 600 main:app

I eventually gave up trying to modify the existing files (going "top-down", if you will), and decided to go "bottom-up" by building off of GPC minimalist examples. My intention was to continue building up functionality, using the original files as endpoints and guides.

ENV VEP_SCRIPT ensembl-vep/vep
ENV ASSEMBLY GRCh38

# Set virtualenv environment variables. This is equivalent to running
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple of deleted lines back.

vep_file = 'vep_cache_homo_sapiens_GRCh38_91.tar.gz'
gcs_vep_file = os.path.join(gcs_dir, vep_file)
local_vep_file = os.path.join(local_cache, vep_file)
check_call('gsutil cp {} {}'.format(gcs_vep_file, local_cache).split())
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return header_names, annotation_lines


def vep_tab_to_json(response):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I already have a TODO to not mutate response. I've added another to make the name more descriptive.

"""Updates response dict to contain annotations grouped by their variants

Args:
response: Dict containing JSON response to send to client
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Will have to modify slightly when refactoring to avoid mutation.

@@ -0,0 +1,98 @@
import sys
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

print 'VEP CACHE setup.'


#Global calls after `setup_vep_cache` is defined but before `app` decorators
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If you place app = Flask(__name__) after app decorators, you'll get a NameError, since app isn't defined yet.
  2. It should be fine to put the setup_vep_cache() behind if __name__ == '__main__'. But app = Flask(__name__) might cause issues (with Gunicorn, for example, though I would have to investigate more).

@bashir2 bashir2 mentioned this pull request Oct 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants