scripts
includes all the scripts for analyzing the mass spectrometry data, and doing the Monte Carlo estimates, and making all the figures.
These scripts assume the following file structure
/scripts/
/data/
/tara_ocean_smags/
/antiox_gene_name_lists/
/protein_expression_data/
/nunn_data/
/phytoplankton_genomes/
/frag_genome/
/phaeo_genome/
/mass_spec_data/
/frag_data/
/mzml_converted/
/fdr_idxml/
/phaeo_data_3/
/mzml_converted/
/fdr_idxml/
For the frag_data
, I got data from PRIDE project with the ID: PXD007098. I specifically used these data files:
D05_1.raw
D120_1.raw
D1_2.raw
L120_3.raw
T0_3.raw
For the phaeo_data_3
, I got data from the PRIDE project with the ID: PXD014877. I used the script download_phaeo_proteome_3.sh
for this.
Data for the phytoplankton genomes for protein stoichiometry were downloaded from this awesome paper by Delmont et al (2021) specifically with:
# annotation files
wget https://www.genoscope.cns.fr/tara/localdata/data/SMAGs-v1/SMAGs_v1_EggNog.tar.gz
# protein sequences
wget https://www.genoscope.cns.fr/tara/localdata/data/SMAGs-v1/SMAGs_v1_concat.faa.tar.gz
Other files ('Table_S03_statistics_nr_SMAGs_METdb.xlsx') were manually downloaded from https://www.genoscope.cns.fr/tara/, and the ferritin sequence used is from https://www.uniprot.org/uniprot/B6DMH6.fasta .
converting_tara_to_single_line_fasta.sh
Converts the large file of SMAGs to a single line fasta file.
tara_oceans_antioxidant.R
Script that gets all the antioxidant protein sequence IDs and makes txt files of lists, for eventually subsetting the large fasta file.
antioxidant_stoich_from_seqs.ipynb
Jupyter notebook that calculates stoichiometric composition of various phytoplankton proteins. The plots for these proteins is in plotting_antiox_stoich.R
.
Once the raw MS data have been downloaded, they need to be converted to mzML. They were converted with convert_raw_to_mzml.sh
, frag_converting_file.sh
, and phaeo_converting_file.sh
.
Then they are searched using the corresponding genomes with database-searching-openms.sh
, database-searching-frag.sh
, and database-searching-phaeo.sh
. Before they are searched the genomes are appended with a contaminant database (CRAP), with adding_crap_to_genomes.sh
.
Peptides are quantified using the FeatureFinderIdentification method (Weisser et al 2017) with the script feature-finder-general.sh
, feature-finder-phyto-proteomes.sh
, and feature-finder-phyto-proteomes-phaeo.sh
.
The output of this is then converted to csv with protein-quant-openms.sh
, protein-quant-frag.sh
, and protein-quant-phaeo.sh
.
We first need to determine some of the parameters, specifically the distribution of fold-changes for protein expression. Three scripts separately analyze the fold-change distribution from three different data sets: nunn_protein_expression_fold_change.R
(Nunn et al 2013, PLoS ONE); li_dist_vals.R
(Li et al 2014, Cell); schmidt_dist_vals.R
(Schmidt et al 2015, Nature Biotechnology). Each of these outputs a parameter file with the appropriate parameters of the fitted log-normal distribution.
Monte Carlo method is done using monte_carlo_antioxidants_both.R
.