Merge pull request #118 from jmmut/feature/reannotation

Re-annotation job: - Added parameter annotation.overwrite to delete the current annotation and write it again - Restriction to a single study
EBIvariation · May 19, 2017 · 8c2e586 · 8c2e586
2 parents 0ab20d0 + bfe15c3
commit 8c2e586
Show file tree

Hide file tree

Showing 29 changed files with 678 additions and 59 deletions.
diff --git a/README.md b/README.md
@@ -69,6 +69,14 @@ The contents from the configuration files can be also provided directly as comma
 * `spring.profiles.active`: "production" to keep track of half-executed jobs using a job repository database, "test" to use an in-memory database that will record a single run
 * `app.opencga.path`: Path to the OpenCGA installation folder. An `ls` in that path should show the conf, analysis, bin and libs folders.
 
+Database credentials used to connect to a MongoDB instance. See [MongoDB options documentation](https://docs.mongodb.com/manual/reference/program/mongo/#options). The database and collection names are listed below in the "Database parameters" section.
+
+* `spring.data.mongodb.authentication-database`
+* `spring.data.mongodb.host`
+* `spring.data.mongodb.password`
+* `spring.data.mongodb.port`
+* `spring.data.mongodb.username`
+
 If using a persistent (not in-memory database), the following information needs to be filled in:
 
 * `job.repository.driverClassName`: JDBC-specific argument that points to the database to use for the job repository (PostgreSQL tested and supported, driver name is `org.postgresql.Driver`)
@@ -79,14 +87,16 @@ If using a persistent (not in-memory database), the following information needs
 Other parameters are:
 
 * `config.db.read-preference`: In a distributed Mongo environment, replica to connect to (primary or secondary, default primary).
-* `--logging.level.uk.ac.ebi.eva`: DEBUG, INFO, WARN, ERROR supported among others. Recommended DEBUG.
-* `--logging.level.org.opencb.opencga`: Recommended DEBUG.
-* `--logging.level.org.springframework`: Recommended INFO or WARN.
+* `logging.level.uk.ac.ebi.eva`: DEBUG, INFO, WARN, ERROR supported among others. Recommended DEBUG.
+* `logging.level.org.opencb.opencga`: Recommended DEBUG.
+* `logging.level.org.springframework`: Recommended INFO or WARN.
 
 
-### General job tuning
+### Job parameters
 
-* `--spring.batch.job.names`: The name of the job to run. At the moment it can be `genotyped-vcf-job`, `aggregated-vcf-job`, `annotate-variants-job`, `calculate-statistics-job` or `drop-study-job`
+#### Job configuration
+
+* `spring.batch.job.names`: The name of the job to run. At the moment it can be `genotyped-vcf-job`, `aggregated-vcf-job`, `annotate-variants-job`, `calculate-statistics-job` or `drop-study-job`
 
 Individual steps can be skipped using one of the following. This is not necessary unless they are irrelevant for the data to be processed, or some input data was generated in previous runs of the same job.
 
@@ -95,9 +105,11 @@ Individual steps can be skipped using one of the following. This is not necessar
 
 Other parameters are:
 
+* `config.chunk.size`: Size of batches across the pipeline (recommended from 100 to 5000).
+* `annotation.overwrite`: True to overwrite annotations already associated to variants. False to annotate only variants without an existing annotation. Please note that if the `input.study.id` parameter is specified, annotation will be limited to variants from that study.
 * `force.restart`: When included as command line parameter allows to restart a a job. This will also mark the last execution not finished of the same job / parameters as cancelled in the job database.
 
-### Job run tuning
+#### Job inputs
 
 * `input.vcf`: Path to the VCF to process. May be compressed.
 * `input.vcf.id`: Unique ID for the VCF to process. Could be an analysis in the SRA model (please ignore if you don't know what SRA is).
@@ -110,12 +122,28 @@ Other parameters are:
 * `input.pedigree`: PED file if available, in order to calculate population-based statistics.
 * `input.fasta`: Path to the FASTA file with the reference sequence, in order to generate the VEP annotation.
 
+#### Job outputs
+
 * `output.dir`: Already existing folder to store the transformed VCF and statistics files.
 * `output.dir.annotation`: Already existing folder to store VEP output files.
 * `output.dir.statistics`: Already existing folder to store statistics output files.
 
+#### Database parameters
+
+Database name and collection names can be specified with these parameters. To set the database credentials, use the parameters in the "Environment" section.
+
+* `spring.data.mongodb.database`: Database name, that contain all the collections
+* `db.collections.variants.name`: Main collection. Has variant coordinates, sample information, and some statistics and annotation.
+* `db.collections.files.name`: File (and study) metadata information.
+* `db.collections.stats.name`: Main collection for statistics. The variants collection might contain a subset of this.
+* `db.collections.annotation.metadata.name`: Main collection for annotation. The variants collection might contain a subset of this.
+
+#### Configuration of third party applications
+
 * `app.vep.cache.path`: Path to the VEP cache root folder.
-* `app.vep.version`: Version of the VEP cache.
+* `app.vep.version`: Version of the VEP executable.
+* `app.vep.cache.version`: Version of the VEP cache.
 * `app.vep.species`: Name of the species as stored in the cache folder.
 * `app.vep.path`: Path to the VEP installation folder.
 * `app.vep.num-forks`: Number of processes to run VEP in parallel (recommended 4).
+* `app.vep.timeout`: If VEP doesn't respond in the specified number of seconds, the pipeline will assume that the step failed (recommended 300).
diff --git a/src/main/java/uk/ac/ebi/eva/pipeline/configuration/BeanNames.java b/src/main/java/uk/ac/ebi/eva/pipeline/configuration/BeanNames.java
@@ -21,7 +21,7 @@
 public class BeanNames {
 
     public static final String GENE_READER = "gene-reader";
-    public static final String NON_ANNOTATED_VARIANTS_READER = "non-annotated-variants-reader";
+    public static final String VARIANTS_READER = "variants-reader";
     public static final String VARIANT_ANNOTATION_READER = "variant-annotation-reader";
     public static final String VARIANT_READER = "variant-reader";
 

diff --git a/...atedVariantsMongoReaderConfiguration.java → ...ers/VariantsMongoReaderConfiguration.java b/...atedVariantsMongoReaderConfiguration.java → ...ers/VariantsMongoReaderConfiguration.java
@@ -20,29 +20,35 @@
 import org.springframework.context.annotation.Configuration;
 import org.springframework.data.mongodb.core.MongoOperations;
 
-import uk.ac.ebi.eva.pipeline.io.readers.NonAnnotatedVariantsMongoReader;
+import uk.ac.ebi.eva.pipeline.io.readers.VariantsMongoReader;
+import uk.ac.ebi.eva.pipeline.parameters.AnnotationParameters;
 import uk.ac.ebi.eva.pipeline.parameters.DatabaseParameters;
 import uk.ac.ebi.eva.pipeline.parameters.InputParameters;
 
-import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.NON_ANNOTATED_VARIANTS_READER;
+import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.VARIANTS_READER;
 
 /**
- * Configuration to inject a NonAnnotatedVariantsMongoReader bean that reads from a mongo database in the pipeline
+ * Configuration to inject a VariantsMongoReader bean that reads from a mongo database in the pipeline
  */
 @Configuration
-public class NonAnnotatedVariantsMongoReaderConfiguration {
+public class VariantsMongoReaderConfiguration {
 
-    @Bean(NON_ANNOTATED_VARIANTS_READER)
+    @Bean(VARIANTS_READER)
     @StepScope
-    public NonAnnotatedVariantsMongoReader nonAnnotatedVariantsMongoReader(MongoOperations mongoOperations,
-                                                                           DatabaseParameters databaseParameters,
-                                                                           InputParameters inputParameters) {
-        NonAnnotatedVariantsMongoReader nonAnnotatedVariantsMongoReader = new NonAnnotatedVariantsMongoReader(
+    public VariantsMongoReader variantsMongoReader(MongoOperations mongoOperations,
+                                                   DatabaseParameters databaseParameters,
+                                                   InputParameters inputParameters,
+                                                   AnnotationParameters annotationParameters) {
+        // to overwrite annotation we have to bring all variants (non annotated and annotated)
+        boolean excludeAnnotated = !annotationParameters.getOverwriteAnnotation();
+
+        VariantsMongoReader variantsMongoReader = new VariantsMongoReader(
                 mongoOperations,
                 databaseParameters.getCollectionVariantsName(),
-                inputParameters.getStudyId());
-        nonAnnotatedVariantsMongoReader.setSaveState(false);
-        return nonAnnotatedVariantsMongoReader;
+                inputParameters.getStudyId(),
+                excludeAnnotated);
+        variantsMongoReader.setSaveState(false);
+        return variantsMongoReader;
     }
 
 }
diff --git a/...ders/NonAnnotatedVariantsMongoReader.java → ...eline/io/readers/VariantsMongoReader.java b/...ders/NonAnnotatedVariantsMongoReader.java → ...eline/io/readers/VariantsMongoReader.java
@@ -37,7 +37,7 @@
  * {@link org.springframework.batch.item.data.MongoItemReader} is using
  * pagination and it is slow with large collections
  */
-public class NonAnnotatedVariantsMongoReader
+public class VariantsMongoReader
         extends AbstractItemCountingItemStreamItemReader<VariantWrapper> implements InitializingBean {
 
     private MongoDbCursorItemReader delegateReader;
@@ -50,9 +50,11 @@ public class NonAnnotatedVariantsMongoReader
     /**
      * @param studyId Can be the empty string or null, meaning to bring all non-annotated variants in the collection.
      * If the studyId string is not empty, bring only non-annotated variants from that study.
+     * @param excludeAnnotated bring only non-annotated variants.
      */
-    public NonAnnotatedVariantsMongoReader(MongoOperations template, String collectionsVariantsName, String studyId) {
-        setName(ClassUtils.getShortName(NonAnnotatedVariantsMongoReader.class));
+    public VariantsMongoReader(MongoOperations template, String collectionsVariantsName, String studyId,
+                               boolean excludeAnnotated) {
+        setName(ClassUtils.getShortName(VariantsMongoReader.class));
         delegateReader = new MongoDbCursorItemReader();
         delegateReader.setTemplate(template);
         delegateReader.setCollection(collectionsVariantsName);
@@ -61,8 +63,10 @@ public NonAnnotatedVariantsMongoReader(MongoOperations template, String collecti
         if (studyId != null && !studyId.isEmpty()) {
             queryBuilder.add(STUDY_KEY, studyId);
         }
-        DBObject query = queryBuilder.add("annot.ct.so", new BasicDBObject("$exists", false)).get();
-        delegateReader.setQuery(query);
+        if (excludeAnnotated) {
+            queryBuilder.add("annot.ct.so", new BasicDBObject("$exists", false));
+        }
+        delegateReader.setQuery(queryBuilder.get());
 
         String[] fields = {"chr", "start", "end", "ref", "alt"};
         delegateReader.setFields(fields);

diff --git a/src/main/java/uk/ac/ebi/eva/pipeline/jobs/AnnotationJob.java b/src/main/java/uk/ac/ebi/eva/pipeline/jobs/AnnotationJob.java
@@ -32,18 +32,18 @@
 
 import uk.ac.ebi.eva.pipeline.jobs.flows.AnnotationFlow;
 import uk.ac.ebi.eva.pipeline.parameters.NewJobIncrementer;
+import uk.ac.ebi.eva.pipeline.parameters.validation.job.AnnotationJobParametersValidator;
 
 import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.ANNOTATE_VARIANTS_JOB;
 import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.VEP_ANNOTATION_FLOW;
 
 /**
  * Batch class to wire together:
- * 1) generateVepInputStep - Dump a list of variants without annotations to be used as input for VEP
- * 2) annotationCreate - run VEP
+ * 1) generateVepInputStep - Dump a list of variants without annotations and run VEP with them
  * 3) annotationLoadBatchStep - Load VEP annotations into mongo
  * <p>
- * Optional flow: variantsAnnotGenerateInput --> (annotationCreate --> annotationLoad)
- * annotationCreate and annotationLoad steps are only executed if variantsAnnotGenerateInput is generating a
+ * Optional flow: variantsAnnotGenerateInput --> (annotationLoad)
+ * annotationLoad step is only executed if variantsAnnotGenerateInput is generating a
  * non-empty VEP input file
  *
  * TODO add a new AnnotationJobParametersValidator
@@ -67,7 +67,8 @@ public Job annotateVariantsJob(JobBuilderFactory jobBuilderFactory) {
 
         JobBuilder jobBuilder = jobBuilderFactory
                 .get(ANNOTATE_VARIANTS_JOB)
-                .incrementer(new NewJobIncrementer());
+                .incrementer(new NewJobIncrementer())
+                .validator(new AnnotationJobParametersValidator());
         return jobBuilder.start(annotation).build().build();
     }
 

diff --git a/src/main/java/uk/ac/ebi/eva/pipeline/jobs/steps/GenerateVepAnnotationStep.java b/src/main/java/uk/ac/ebi/eva/pipeline/jobs/steps/GenerateVepAnnotationStep.java
@@ -30,15 +30,15 @@
 import org.springframework.context.annotation.Import;
 
 import uk.ac.ebi.eva.pipeline.configuration.ChunkSizeCompletionPolicyConfiguration;
-import uk.ac.ebi.eva.pipeline.configuration.readers.NonAnnotatedVariantsMongoReaderConfiguration;
+import uk.ac.ebi.eva.pipeline.configuration.readers.VariantsMongoReaderConfiguration;
 import uk.ac.ebi.eva.pipeline.configuration.writers.VepAnnotationFileWriterConfiguration;
 import uk.ac.ebi.eva.pipeline.io.readers.AnnotationFlatFileReader;
 import uk.ac.ebi.eva.pipeline.listeners.StepProgressListener;
 import uk.ac.ebi.eva.pipeline.model.VariantWrapper;
 import uk.ac.ebi.eva.pipeline.parameters.JobOptions;
 
 import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.GENERATE_VEP_ANNOTATION_STEP;
-import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.NON_ANNOTATED_VARIANTS_READER;
+import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.VARIANTS_READER;
 import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.VEP_ANNOTATION_WRITER;
 
 /**
@@ -51,14 +51,14 @@
  */
 @Configuration
 @EnableBatchProcessing
-@Import({NonAnnotatedVariantsMongoReaderConfiguration.class, VepAnnotationFileWriterConfiguration.class,
+@Import({VariantsMongoReaderConfiguration.class, VepAnnotationFileWriterConfiguration.class,
         ChunkSizeCompletionPolicyConfiguration.class})
 public class GenerateVepAnnotationStep {
 
     private static final Logger logger = LoggerFactory.getLogger(GenerateVepAnnotationStep.class);
 
     @Autowired
-    @Qualifier(NON_ANNOTATED_VARIANTS_READER)
+    @Qualifier(VARIANTS_READER)
     private ItemStreamReader<VariantWrapper> nonAnnotatedVariantsReader;
 
     @Autowired

diff --git a/src/main/java/uk/ac/ebi/eva/pipeline/model/VariantWrapper.java b/src/main/java/uk/ac/ebi/eva/pipeline/model/VariantWrapper.java
@@ -25,12 +25,13 @@ public class VariantWrapper {
     private Variant variant;
     private String strand = "+";
 
-    public VariantWrapper(Variant variant) {
-        this.variant = variant.copyInEnsemblFormat();
+    public VariantWrapper(String chromosome, int start, int end, String reference, String alternate) {
+        this(new Variant(chromosome, start, end, reference, alternate));
     }
 
-    public VariantWrapper(String chromosome, int start, int end, String reference, String alternate) {
-        this.variant = new Variant(chromosome, start, end, reference, alternate).copyInEnsemblFormat();
+    public VariantWrapper(Variant variant) {
+        this.variant = variant;
+        transformToEnsemblFormat();
     }
 
     public String getChr() {
@@ -53,4 +54,15 @@ public String getStrand() {
         return strand;
     }
 
+    private void transformToEnsemblFormat() {
+        variant.setEnd(variant.getStart() + variant.getReference().length() - 1);
+
+        if (variant.getReference().equals("")) {
+            variant.setReference("-");
+        }
+
+        if (variant.getAlternate().equals("")) {
+            variant.setAlternate("-");
+        }
+    }
 }
diff --git a/src/main/java/uk/ac/ebi/eva/pipeline/parameters/AnnotationParameters.java b/src/main/java/uk/ac/ebi/eva/pipeline/parameters/AnnotationParameters.java
@@ -69,6 +69,9 @@ public class AnnotationParameters {
     @Value(PARAMETER + JobParametersNames.INPUT_FASTA + END)
     private String inputFasta;
 
+    @Value(PARAMETER + JobParametersNames.ANNOTATION_OVERWRITE + "']?:false}")
+    private Boolean overwriteAnnotation;
+
     public String getVepPath() {
         return vepPath;
     }
@@ -101,6 +104,10 @@ public String getInputFasta() {
         return inputFasta;
     }
 
+    public Boolean getOverwriteAnnotation() {
+        return overwriteAnnotation;
+    }
+
     public String getVepOutput() {
         return URLHelper.resolveVepOutput(outputDirAnnotation, studyId, fileId);
     }

diff --git a/src/main/java/uk/ac/ebi/eva/pipeline/parameters/JobParametersNames.java b/src/main/java/uk/ac/ebi/eva/pipeline/parameters/JobParametersNames.java
@@ -94,7 +94,9 @@ public class JobParametersNames {
 
     public static final String STATISTICS_SKIP = "statistics.skip";
 
-    public static final String STATISTICS_OVERWRITE = "statistics.overwrite";
+    public static final String STATISTICS_OVERWRITE = "statistics.overwrite";   // FIXME this is only used in tests
+
+    public static final String ANNOTATION_OVERWRITE = "annotation.overwrite";
 
 
     /*

diff --git a/src/main/java/uk/ac/ebi/eva/pipeline/parameters/validation/AnnotationOverwriteValidator.java b/src/main/java/uk/ac/ebi/eva/pipeline/parameters/validation/AnnotationOverwriteValidator.java
@@ -0,0 +1,41 @@
+/*
+ * Copyright 2015-2017 EMBL - European Bioinformatics Institute
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package uk.ac.ebi.eva.pipeline.parameters.validation;
+
+import org.springframework.batch.core.JobParameters;
+import org.springframework.batch.core.JobParametersInvalidException;
+import org.springframework.batch.core.JobParametersValidator;
+
+import uk.ac.ebi.eva.pipeline.parameters.JobParametersNames;
+
+/**
+ * Checks that the option to overwrite annotation has been filled in and it is "true" or "false".
+ *
+ * Throws JobParametersInvalidException If the overwrite annotation option is null or empty or any text different
+ * from 'true' or 'false'
+ */
+public class AnnotationOverwriteValidator implements JobParametersValidator {
+
+    @Override
+    public void validate(JobParameters parameters) throws JobParametersInvalidException {
+        String annotationOverwriteValue = parameters.getString(JobParametersNames.ANNOTATION_OVERWRITE);
+
+        ParametersValidatorUtil.checkIsValidString(
+                annotationOverwriteValue, JobParametersNames.ANNOTATION_OVERWRITE);
+        ParametersValidatorUtil.checkIsBoolean(
+                annotationOverwriteValue,JobParametersNames.ANNOTATION_OVERWRITE);
+    }
+}