Skip to content

Quickstart

Mateo Torres edited this page Sep 22, 2021 · 7 revisions

This step-by-step guide is intended to get you started with S2F. We will go through installing S2F, and making your first prediction.

Requirements

You need to make sure that your computer meets the requirements, and you either download and clone this repository using git. Important: If you want to use the installer with the provided configuration file, please make sure that the binaries of the requirements are included in your PATH, these binaries are:

  • iprscan
  • phmmer
  • blastp
  • makeblastdb

Let's start by cloning this repository and moving into it to install S2F.

git clone https://github.com/paccanarolab/S2F
cd S2F

Install the requirements

S2F depends on some standard Python libraries, you can install all of them by running

pip install -r requirements.txt

Install S2F

S2F comes with an interactive command line installer, simply run

python S2F.py install --config-file s2f.conf

This will download all of the required databases and will save a configuration file with all the options you've chosen. You can modify this configuration file later if you decide to, say, move the location of the databases to a different drive. Important: this will download a copy of the STRING, GOA, and SwissProt databases, which have a considerable size, the installation might take several hours, or even days, depending on the speed of your internet connection.

Running your first prediction

Required downloads

You will need to download the Gene Ontology go.obo file.

For this guide, let's assume that you have downloaded a FASTA file (If you don't know where to get one, you can download the Suppplementary Data from our website). Let the name of the FASTA file be target.fasta

Making your first prediction

The simplest way to make a prediciton is to run the following command

python S2F.py predict --config-file s2f.conf --alias myTarget --fasta target.fasta --obo go.obo

The process will then begin, and you will be able to find the results on S2F's installation directory. Several messages will appear in the terminal while S2F is running to update you on the current step of the pipeline. For a complete run (which includes the pairwise alignment of the provided proteome against the entire STRING database), the total runtime will take between 5 and 10 hours on a computer with 12 cores. S2F will save intermediate results and will maintain a cache of the aligned sequences for subsequent runs, reducing the runtime significantly.

Let the configured output directory be ~/S2F-installation/output. For the prediction command used above, the prediction file will be located at ~/S2F-installation/output/myTarget/prediction.df. This is a tab separated text file, with the following columns:

  • Protein ID (matches the IDs in the FASTA file used as input)
  • GO term ID (from the provided go.obo file)
  • Score

Minimal example

In this section, we go over the set of commands that you would use on an example fasta file in order to get predictions using S2F. We assume that you have followed the installation instructions.

For this case, we will be using the 83332.fasta file, which contains the protein sequences for Mycobacterium tuberculosis. (You can download a copy of this file in the Suppplementary Data from our website). Assuming you download this file to the directory where S2F.py is located, you simply need to run:

python S2F.py predict --config-file s2f.conf --alias 83332 --fasta 83332.fasta --obo go.obo

By default, the output file will be located at ~/S2F-installation/output/83332/prediction.df. As explained above the file is a tab separated file. The first 10 lines look like this:

sp|A0A089QKZ7|Y155A_MYCTU       GO:0000001      9.224062214741884e-07
sp|A0A089QRB9|MSL3_MYCTU        GO:0000001      1.3479576439607364e-07
sp|E2FZM4|SOCA_MYCTU    GO:0000001      1.1725513954101507e-06
sp|E2FZM5|SOCB_MYCTU    GO:0000001      1.0743922748653777e-06
sp|I6WXS6|VPB51_MYCTU   GO:0000001      3.563783549701623e-07
sp|I6WZK7|MMCO_MYCTU    GO:0000001      1.0454690835351242e-06
sp|I6X486|PE25_MYCTU    GO:0000001      1.1318501983829803e-07
sp|I6X7F9|CDDTR_MYCTU   GO:0000001      3.906041154000092e-07
sp|I6X8R5|RV203_MYCTU   GO:0000001      7.748154286220518e-07
sp|I6XD65|PNCA_MYCTU    GO:0000001      1.0809237542558064e-07
... (there are 78939830 more lines in this file)