Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create helper tool that generates information for a peptide order form #947

Open
malachig opened this issue Apr 5, 2023 · 0 comments
Open
Labels
Milestone

Comments

@malachig
Copy link
Member

malachig commented Apr 5, 2023

Introduction to the issue
Currently a very manual (slow and error prone) step in producing a neoantigen vaccine design is creating a spreadsheet of peptides to be supplied to the manufacturer. The peptide manufacturer will ultimately synthesize peptides of length ~25-35 amino acids. The class I peptides prioritized are typically 8-11 amino acids long. The class II peptides prioritized are typically 12-18 amino acids long. We supply the manufacturer with a 51-mer long peptide from which they can choose targets to synthesize that are ~25-35. We therefore want a file that contains the 51-mers, but indicates where the shorter classI and classII peptides are to inform selection of a 25-35mer (i.e. one that includes the target class I and/or class II sequences).

Currently to create this spreadsheet we use the pvacseq generate_protein_fasta command. We supply the TSV exported from pVACview using the --input-tsv and --aggregate-report-evaluation "Accept,Review" option. The first two columns of the resulting "*manufacturability.tsv" file contain the info we need to start creating the peptide order form. BUT, a lot of manual intervention is required. Nevertheless, using pvacseq generate_protein_fasta as a starting point, or incorporating the creation of this new file into that tool could be a reasonable approach.

Description of Required Input Files:

  • Variants-VCF. annotated.expression.vcf.gz (and .tbi; both from immuno pipeline)
  • Phased-Variants-VCF. phased.vcf.gz (and .tbi; both from immuno pipeline)
  • Reviewed-Candidates-TSV. Annotated.Neoantigen_Candidates-Final.tsv (exported from pVACview after ITB)
  • ClassI-Aggregated-Report. TumorDNA.all_epitopes.aggregated.classI.tsv (from pVACseq)
  • ClassII-Aggregated-Report. TumorDNA.all_epitopes.aggregated.classII.tsv (from pVACseq)
  • Cancer-Genes-TSV. CancerGeneCensus-Mar2023.tsv (from Cancer Gene Census as used in pVACview)

Description of Desired Output Files:

  • Peptide-Fasta.
  • Peptide-List. Annotated_filtered.vcf-pass-51mer.fa.manufacturability.tsv (from pvacseq generate_protein_fasta)
  • Peptide-Order-Form. Peptides_51-mer.xlsx

Format of Peptide-Order-Form:
The target output file has 5 columns.

  1. ID. e.g. "MT.10.RAB11FIP5.ENST00000258098.6.missense.131L/Q". This is the current ID already created by pvacseq generate_protein_fasta. It contains enough information to unambiguously understand how the long peptide sequence was created and to link this record back to more information in pVACseq.
  2. CANDIDATE NEOANTIGEN. e.g. "7071-04-MT.10.RAB11FIP5". This is a short hand name for the peptide that has been pre-pended with a patient identifier.
  3. CANDIDATE NEOANTIGEN AMINO ACID SEQUENCE WITH FLANKING RESIDUES. This is the long peptide sequence (e.g. a 51-mer) already created by pvacseq generate_protein_fasta. Except it has been "highlighted" by following the rules described below. e.g. "CELVLTTMHRSLIGVDKFLGQATVAQDEVFGAGRAQHTQWYKLHSKPGKKE".
  4. RESTRICTING HLA ALLELE. The class I and/or class II HLA Alleles that correspond to the peptides highlighted in the previous cells. e.g. "HLA-B15:01 / DQA101:01". Include the "best" HLA allele from the ClassI-Aggregated-Report and ClassII-Aggregated-Report. We might want to have some minimum criteria here but for now just include the best.
  5. CANDIDATE NEOANTIGEN AMINO ACID SEQUENCE MW (CLIENT). e.g. "5711.56". This is the calculated molecular weight of the peptide in the 3rd column. This is currently being done manually with EMBOSS PepStats. But it is simply adding up the molecular weights of the individual peptides using a lookup table. This could be done with a few lines of Python (or an existing Python library).

Description of the Peptide Highlighting:
NOTE: Initially all of this information could be encoded in a TSV without any painful text formatting. This might be a useful intermediary file to have anyway, if downstream automated processing is added. But if we could get the following "pretty" version of the file created automatically that would save a lot of time:

Rules for highlighting peptide sequences in column 3.

a. The mutant positions as indicated in the Reviewed-Candidates-TSV should be underlined. Note that this is now the position within the long peptide. Presumably pvacseq generate_protein_fasta knows what this position is because it uses this information to create the long peptide flanking the mutation.
b. The best class I candidate neoantigen sequence as indicated in the Reviewed-Candidates-TSV should be red text
c. The best class II candidate neoantigen sequence as indicated in the Reviewed-Candidates-TSV should be bold text
d. If the annotated gene of the candidate as indicated in the Reviewed-Candidates-TSV is a cancer gene, the whole row should be highlighted green. The definition of a cancer gene here is simply an exact match between the gene name in the Reviewed-Candidates-TSV and the first column of the Cancer-Genes-TSV.

Actual examples of all input and output files to use as a test case
/storage1/fs1/mgriffit/Active/griffithlab/example_data/peptide_order/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants