Skip to content

Commit

Permalink
Merge pull request #118 from jmmut/feature/reannotation
Browse files Browse the repository at this point in the history
Re-annotation job:
- Added parameter annotation.overwrite to delete the current annotation and write it again
- Restriction to a single study
  • Loading branch information
Cristina Yenyxe Gonzalez Garcia authored May 19, 2017
2 parents 0ab20d0 + bfe15c3 commit 8c2e586
Show file tree
Hide file tree
Showing 29 changed files with 678 additions and 59 deletions.
42 changes: 35 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,14 @@ The contents from the configuration files can be also provided directly as comma
* `spring.profiles.active`: "production" to keep track of half-executed jobs using a job repository database, "test" to use an in-memory database that will record a single run
* `app.opencga.path`: Path to the OpenCGA installation folder. An `ls` in that path should show the conf, analysis, bin and libs folders.

Database credentials used to connect to a MongoDB instance. See [MongoDB options documentation](https://docs.mongodb.com/manual/reference/program/mongo/#options). The database and collection names are listed below in the "Database parameters" section.

* `spring.data.mongodb.authentication-database`
* `spring.data.mongodb.host`
* `spring.data.mongodb.password`
* `spring.data.mongodb.port`
* `spring.data.mongodb.username`

If using a persistent (not in-memory database), the following information needs to be filled in:

* `job.repository.driverClassName`: JDBC-specific argument that points to the database to use for the job repository (PostgreSQL tested and supported, driver name is `org.postgresql.Driver`)
Expand All @@ -79,14 +87,16 @@ If using a persistent (not in-memory database), the following information needs
Other parameters are:

* `config.db.read-preference`: In a distributed Mongo environment, replica to connect to (primary or secondary, default primary).
* `--logging.level.uk.ac.ebi.eva`: DEBUG, INFO, WARN, ERROR supported among others. Recommended DEBUG.
* `--logging.level.org.opencb.opencga`: Recommended DEBUG.
* `--logging.level.org.springframework`: Recommended INFO or WARN.
* `logging.level.uk.ac.ebi.eva`: DEBUG, INFO, WARN, ERROR supported among others. Recommended DEBUG.
* `logging.level.org.opencb.opencga`: Recommended DEBUG.
* `logging.level.org.springframework`: Recommended INFO or WARN.


### General job tuning
### Job parameters

* `--spring.batch.job.names`: The name of the job to run. At the moment it can be `genotyped-vcf-job`, `aggregated-vcf-job`, `annotate-variants-job`, `calculate-statistics-job` or `drop-study-job`
#### Job configuration

* `spring.batch.job.names`: The name of the job to run. At the moment it can be `genotyped-vcf-job`, `aggregated-vcf-job`, `annotate-variants-job`, `calculate-statistics-job` or `drop-study-job`

Individual steps can be skipped using one of the following. This is not necessary unless they are irrelevant for the data to be processed, or some input data was generated in previous runs of the same job.

Expand All @@ -95,9 +105,11 @@ Individual steps can be skipped using one of the following. This is not necessar

Other parameters are:

* `config.chunk.size`: Size of batches across the pipeline (recommended from 100 to 5000).
* `annotation.overwrite`: True to overwrite annotations already associated to variants. False to annotate only variants without an existing annotation. Please note that if the `input.study.id` parameter is specified, annotation will be limited to variants from that study.
* `force.restart`: When included as command line parameter allows to restart a a job. This will also mark the last execution not finished of the same job / parameters as cancelled in the job database.

### Job run tuning
#### Job inputs

* `input.vcf`: Path to the VCF to process. May be compressed.
* `input.vcf.id`: Unique ID for the VCF to process. Could be an analysis in the SRA model (please ignore if you don't know what SRA is).
Expand All @@ -110,12 +122,28 @@ Other parameters are:
* `input.pedigree`: PED file if available, in order to calculate population-based statistics.
* `input.fasta`: Path to the FASTA file with the reference sequence, in order to generate the VEP annotation.

#### Job outputs

* `output.dir`: Already existing folder to store the transformed VCF and statistics files.
* `output.dir.annotation`: Already existing folder to store VEP output files.
* `output.dir.statistics`: Already existing folder to store statistics output files.

#### Database parameters

Database name and collection names can be specified with these parameters. To set the database credentials, use the parameters in the "Environment" section.

* `spring.data.mongodb.database`: Database name, that contain all the collections
* `db.collections.variants.name`: Main collection. Has variant coordinates, sample information, and some statistics and annotation.
* `db.collections.files.name`: File (and study) metadata information.
* `db.collections.stats.name`: Main collection for statistics. The variants collection might contain a subset of this.
* `db.collections.annotation.metadata.name`: Main collection for annotation. The variants collection might contain a subset of this.

#### Configuration of third party applications

* `app.vep.cache.path`: Path to the VEP cache root folder.
* `app.vep.version`: Version of the VEP cache.
* `app.vep.version`: Version of the VEP executable.
* `app.vep.cache.version`: Version of the VEP cache.
* `app.vep.species`: Name of the species as stored in the cache folder.
* `app.vep.path`: Path to the VEP installation folder.
* `app.vep.num-forks`: Number of processes to run VEP in parallel (recommended 4).
* `app.vep.timeout`: If VEP doesn't respond in the specified number of seconds, the pipeline will assume that the step failed (recommended 300).
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
public class BeanNames {

public static final String GENE_READER = "gene-reader";
public static final String NON_ANNOTATED_VARIANTS_READER = "non-annotated-variants-reader";
public static final String VARIANTS_READER = "variants-reader";
public static final String VARIANT_ANNOTATION_READER = "variant-annotation-reader";
public static final String VARIANT_READER = "variant-reader";

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,29 +20,35 @@
import org.springframework.context.annotation.Configuration;
import org.springframework.data.mongodb.core.MongoOperations;

import uk.ac.ebi.eva.pipeline.io.readers.NonAnnotatedVariantsMongoReader;
import uk.ac.ebi.eva.pipeline.io.readers.VariantsMongoReader;
import uk.ac.ebi.eva.pipeline.parameters.AnnotationParameters;
import uk.ac.ebi.eva.pipeline.parameters.DatabaseParameters;
import uk.ac.ebi.eva.pipeline.parameters.InputParameters;

import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.NON_ANNOTATED_VARIANTS_READER;
import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.VARIANTS_READER;

/**
* Configuration to inject a NonAnnotatedVariantsMongoReader bean that reads from a mongo database in the pipeline
* Configuration to inject a VariantsMongoReader bean that reads from a mongo database in the pipeline
*/
@Configuration
public class NonAnnotatedVariantsMongoReaderConfiguration {
public class VariantsMongoReaderConfiguration {

@Bean(NON_ANNOTATED_VARIANTS_READER)
@Bean(VARIANTS_READER)
@StepScope
public NonAnnotatedVariantsMongoReader nonAnnotatedVariantsMongoReader(MongoOperations mongoOperations,
DatabaseParameters databaseParameters,
InputParameters inputParameters) {
NonAnnotatedVariantsMongoReader nonAnnotatedVariantsMongoReader = new NonAnnotatedVariantsMongoReader(
public VariantsMongoReader variantsMongoReader(MongoOperations mongoOperations,
DatabaseParameters databaseParameters,
InputParameters inputParameters,
AnnotationParameters annotationParameters) {
// to overwrite annotation we have to bring all variants (non annotated and annotated)
boolean excludeAnnotated = !annotationParameters.getOverwriteAnnotation();

VariantsMongoReader variantsMongoReader = new VariantsMongoReader(
mongoOperations,
databaseParameters.getCollectionVariantsName(),
inputParameters.getStudyId());
nonAnnotatedVariantsMongoReader.setSaveState(false);
return nonAnnotatedVariantsMongoReader;
inputParameters.getStudyId(),
excludeAnnotated);
variantsMongoReader.setSaveState(false);
return variantsMongoReader;
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
* {@link org.springframework.batch.item.data.MongoItemReader} is using
* pagination and it is slow with large collections
*/
public class NonAnnotatedVariantsMongoReader
public class VariantsMongoReader
extends AbstractItemCountingItemStreamItemReader<VariantWrapper> implements InitializingBean {

private MongoDbCursorItemReader delegateReader;
Expand All @@ -50,9 +50,11 @@ public class NonAnnotatedVariantsMongoReader
/**
* @param studyId Can be the empty string or null, meaning to bring all non-annotated variants in the collection.
* If the studyId string is not empty, bring only non-annotated variants from that study.
* @param excludeAnnotated bring only non-annotated variants.
*/
public NonAnnotatedVariantsMongoReader(MongoOperations template, String collectionsVariantsName, String studyId) {
setName(ClassUtils.getShortName(NonAnnotatedVariantsMongoReader.class));
public VariantsMongoReader(MongoOperations template, String collectionsVariantsName, String studyId,
boolean excludeAnnotated) {
setName(ClassUtils.getShortName(VariantsMongoReader.class));
delegateReader = new MongoDbCursorItemReader();
delegateReader.setTemplate(template);
delegateReader.setCollection(collectionsVariantsName);
Expand All @@ -61,8 +63,10 @@ public NonAnnotatedVariantsMongoReader(MongoOperations template, String collecti
if (studyId != null && !studyId.isEmpty()) {
queryBuilder.add(STUDY_KEY, studyId);
}
DBObject query = queryBuilder.add("annot.ct.so", new BasicDBObject("$exists", false)).get();
delegateReader.setQuery(query);
if (excludeAnnotated) {
queryBuilder.add("annot.ct.so", new BasicDBObject("$exists", false));
}
delegateReader.setQuery(queryBuilder.get());

String[] fields = {"chr", "start", "end", "ref", "alt"};
delegateReader.setFields(fields);
Expand Down
11 changes: 6 additions & 5 deletions src/main/java/uk/ac/ebi/eva/pipeline/jobs/AnnotationJob.java
Original file line number Diff line number Diff line change
Expand Up @@ -32,18 +32,18 @@

import uk.ac.ebi.eva.pipeline.jobs.flows.AnnotationFlow;
import uk.ac.ebi.eva.pipeline.parameters.NewJobIncrementer;
import uk.ac.ebi.eva.pipeline.parameters.validation.job.AnnotationJobParametersValidator;

import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.ANNOTATE_VARIANTS_JOB;
import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.VEP_ANNOTATION_FLOW;

/**
* Batch class to wire together:
* 1) generateVepInputStep - Dump a list of variants without annotations to be used as input for VEP
* 2) annotationCreate - run VEP
* 1) generateVepInputStep - Dump a list of variants without annotations and run VEP with them
* 3) annotationLoadBatchStep - Load VEP annotations into mongo
* <p>
* Optional flow: variantsAnnotGenerateInput --> (annotationCreate --> annotationLoad)
* annotationCreate and annotationLoad steps are only executed if variantsAnnotGenerateInput is generating a
* Optional flow: variantsAnnotGenerateInput --> (annotationLoad)
* annotationLoad step is only executed if variantsAnnotGenerateInput is generating a
* non-empty VEP input file
*
* TODO add a new AnnotationJobParametersValidator
Expand All @@ -67,7 +67,8 @@ public Job annotateVariantsJob(JobBuilderFactory jobBuilderFactory) {

JobBuilder jobBuilder = jobBuilderFactory
.get(ANNOTATE_VARIANTS_JOB)
.incrementer(new NewJobIncrementer());
.incrementer(new NewJobIncrementer())
.validator(new AnnotationJobParametersValidator());
return jobBuilder.start(annotation).build().build();
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,15 @@
import org.springframework.context.annotation.Import;

import uk.ac.ebi.eva.pipeline.configuration.ChunkSizeCompletionPolicyConfiguration;
import uk.ac.ebi.eva.pipeline.configuration.readers.NonAnnotatedVariantsMongoReaderConfiguration;
import uk.ac.ebi.eva.pipeline.configuration.readers.VariantsMongoReaderConfiguration;
import uk.ac.ebi.eva.pipeline.configuration.writers.VepAnnotationFileWriterConfiguration;
import uk.ac.ebi.eva.pipeline.io.readers.AnnotationFlatFileReader;
import uk.ac.ebi.eva.pipeline.listeners.StepProgressListener;
import uk.ac.ebi.eva.pipeline.model.VariantWrapper;
import uk.ac.ebi.eva.pipeline.parameters.JobOptions;

import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.GENERATE_VEP_ANNOTATION_STEP;
import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.NON_ANNOTATED_VARIANTS_READER;
import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.VARIANTS_READER;
import static uk.ac.ebi.eva.pipeline.configuration.BeanNames.VEP_ANNOTATION_WRITER;

/**
Expand All @@ -51,14 +51,14 @@
*/
@Configuration
@EnableBatchProcessing
@Import({NonAnnotatedVariantsMongoReaderConfiguration.class, VepAnnotationFileWriterConfiguration.class,
@Import({VariantsMongoReaderConfiguration.class, VepAnnotationFileWriterConfiguration.class,
ChunkSizeCompletionPolicyConfiguration.class})
public class GenerateVepAnnotationStep {

private static final Logger logger = LoggerFactory.getLogger(GenerateVepAnnotationStep.class);

@Autowired
@Qualifier(NON_ANNOTATED_VARIANTS_READER)
@Qualifier(VARIANTS_READER)
private ItemStreamReader<VariantWrapper> nonAnnotatedVariantsReader;

@Autowired
Expand Down
20 changes: 16 additions & 4 deletions src/main/java/uk/ac/ebi/eva/pipeline/model/VariantWrapper.java
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,13 @@ public class VariantWrapper {
private Variant variant;
private String strand = "+";

public VariantWrapper(Variant variant) {
this.variant = variant.copyInEnsemblFormat();
public VariantWrapper(String chromosome, int start, int end, String reference, String alternate) {
this(new Variant(chromosome, start, end, reference, alternate));
}

public VariantWrapper(String chromosome, int start, int end, String reference, String alternate) {
this.variant = new Variant(chromosome, start, end, reference, alternate).copyInEnsemblFormat();
public VariantWrapper(Variant variant) {
this.variant = variant;
transformToEnsemblFormat();
}

public String getChr() {
Expand All @@ -53,4 +54,15 @@ public String getStrand() {
return strand;
}

private void transformToEnsemblFormat() {
variant.setEnd(variant.getStart() + variant.getReference().length() - 1);

if (variant.getReference().equals("")) {
variant.setReference("-");
}

if (variant.getAlternate().equals("")) {
variant.setAlternate("-");
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,9 @@ public class AnnotationParameters {
@Value(PARAMETER + JobParametersNames.INPUT_FASTA + END)
private String inputFasta;

@Value(PARAMETER + JobParametersNames.ANNOTATION_OVERWRITE + "']?:false}")
private Boolean overwriteAnnotation;

public String getVepPath() {
return vepPath;
}
Expand Down Expand Up @@ -101,6 +104,10 @@ public String getInputFasta() {
return inputFasta;
}

public Boolean getOverwriteAnnotation() {
return overwriteAnnotation;
}

public String getVepOutput() {
return URLHelper.resolveVepOutput(outputDirAnnotation, studyId, fileId);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,9 @@ public class JobParametersNames {

public static final String STATISTICS_SKIP = "statistics.skip";

public static final String STATISTICS_OVERWRITE = "statistics.overwrite";
public static final String STATISTICS_OVERWRITE = "statistics.overwrite"; // FIXME this is only used in tests

public static final String ANNOTATION_OVERWRITE = "annotation.overwrite";


/*
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
/*
* Copyright 2015-2017 EMBL - European Bioinformatics Institute
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package uk.ac.ebi.eva.pipeline.parameters.validation;

import org.springframework.batch.core.JobParameters;
import org.springframework.batch.core.JobParametersInvalidException;
import org.springframework.batch.core.JobParametersValidator;

import uk.ac.ebi.eva.pipeline.parameters.JobParametersNames;

/**
* Checks that the option to overwrite annotation has been filled in and it is "true" or "false".
*
* Throws JobParametersInvalidException If the overwrite annotation option is null or empty or any text different
* from 'true' or 'false'
*/
public class AnnotationOverwriteValidator implements JobParametersValidator {

@Override
public void validate(JobParameters parameters) throws JobParametersInvalidException {
String annotationOverwriteValue = parameters.getString(JobParametersNames.ANNOTATION_OVERWRITE);

ParametersValidatorUtil.checkIsValidString(
annotationOverwriteValue, JobParametersNames.ANNOTATION_OVERWRITE);
ParametersValidatorUtil.checkIsBoolean(
annotationOverwriteValue,JobParametersNames.ANNOTATION_OVERWRITE);
}
}
Loading

0 comments on commit 8c2e586

Please sign in to comment.