Skip to content

Commit

Permalink
Merge commit '38a7b3cc53d6d65dbdfd65268ab07bd0951b446e'
Browse files Browse the repository at this point in the history
  • Loading branch information
gamcil committed Dec 4, 2024
2 parents ecae81c + 38a7b3c commit 1c2a03b
Show file tree
Hide file tree
Showing 58 changed files with 2,586 additions and 386 deletions.
8 changes: 6 additions & 2 deletions lib/foldseek/.github/workflows/mac-arm64.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,20 @@ on:

jobs:
build:
runs-on: [self-hosted, macOS, ARM64]
runs-on: macos-latest
steps:
- uses: actions/checkout@v3
with:
submodules: true

- name: Dependencies
run: |
brew install -f --overwrite cmake libomp rustup
rustup-init --profile minimal -q -y
- name: Build
run: |
mkdir -p build
rustup update
cd build
LIBOMP=$(brew --prefix libomp)
cmake \
Expand Down
125 changes: 95 additions & 30 deletions lib/foldseek/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,26 +14,45 @@ Foldseek enables fast and sensitive comparisons of large protein structure sets.
# Table of Contents

- [Foldseek](#foldseek)
- [Webserver](#webserver)
- [Installation](#installation)
- [Memory requirements](#memory-requirements)
- [Tutorial Video](#tutorial-video)
- [Documentation](#documentation)
- [Quick Start](#quick-start)
- [Search](#search)
- [Output](#output-search)
- [Important Parameters](#important-search-parameters)
- [Alignment Mode](#alignment-mode)
- [Structure search from FASTA input](#structure-search-from-fasta-input)
- [Databases](#databases)
- [Create Custom Databases and Indexes](#create-custom-databases-and-indexes)
- [Cluster](#cluster)
- [Output](#output-cluster)
- [Important Parameters](#important-cluster-parameters)
- [Multimer](#multimersearch)
- [Output](#multimer-search-output)
- [Main Modules](#main-modules)
- [Examples](#examples)
- [Publications](#publications)
- [Table of Contents](#table-of-contents)
- [Webserver](#webserver)
- [Installation](#installation)
- [Memory requirements](#memory-requirements)
- [Tutorial Video](#tutorial-video)
- [Documentation](#documentation)
- [Quick start](#quick-start)
- [Search](#search)
- [Output Search](#output-search)
- [Tab-separated](#tab-separated)
- [Superpositioned Cα only PDB files](#superpositioned-cα-only-pdb-files)
- [Interactive HTML](#interactive-html)
- [Important search parameters](#important-search-parameters)
- [Alignment Mode](#alignment-mode)
- [Structure search from FASTA input](#structure-search-from-fasta-input)
- [Databases](#databases)
- [Create custom databases and indexes](#create-custom-databases-and-indexes)
- [Cluster](#cluster)
- [Output Cluster](#output-cluster)
- [Tab-separated cluster](#tab-separated-cluster)
- [Representative fasta](#representative-fasta)
- [All member fasta](#all-member-fasta)
- [Important cluster parameters](#important-cluster-parameters)
- [Multimersearch](#multimersearch)
- [Using Multimersearch](#using-multimersearch)
- [Multimer Search Output](#multimer-search-output)
- [Tab-separated-complex](#tab-separated-complex)
- [Complex Report](#complex-report)
- [Multimercluster](#multimercluster)
- [Output MultimerCluster](#output-multimercluster)
- [Tab-separated multimercluster](#tab-separated-multimercluster)
- [Representative multimer fasta](#representative-multimer-fasta)
- [Filtered search result](#filtered-search-result)
- [Important multimer cluster parameters](#important-multimer-cluster-parameters)
- [Main Modules](#main-modules)
- [Examples](#examples)
- [Rescore aligments using TMscore](#rescore-aligments-using-tmscore)
- [Query centered multiple sequence alignment](#query-centered-multiple-sequence-alignment)

## Webserver
Search your protein structures against the [AlphaFoldDB](https://alphafold.ebi.ac.uk/) and [PDB](https://www.rcsb.org/) in seconds using the Foldseek webserver ([code](https://github.com/soedinglab/mmseqs2-app)): [search.foldseek.com](https://search.foldseek.com) 🚀
Expand Down Expand Up @@ -238,6 +257,7 @@ MCAR...Q
| --cov-mode | Alignment | 0: coverage of query and target, 1: coverage of target, 2: coverage of query |
| --min-seq-id | Alignment | the minimum sequence identity to be clustered |
| --tmscore-threshold | Alignment | accept alignments with an alignment TMscore > thr |
| --tmscore-threshold-mode | Alignment | normalize TMscore by 0: alignment, 1: representative, 2: member length |
| --lddt-threshold | Alignment | accept alignments with an alignment LDDT score > thr |


Expand Down Expand Up @@ -300,9 +320,64 @@ The default output fields are: `query,target,fident,alnlen,mismatch,gapopen,qsta
1tim.pdb.gz 8tim.pdb.gz A,B A,B 0.98941 0.98941 0.999983,0.000332,0.005813,-0.000373,0.999976,0.006884,-0.005811,-0.006886,0.999959 0.298992,0.060047,0.565875 0
```

### Multimercluster
The `easy-multimercluster` module is designed for multimer-level structural clustering(supported input formats: PDB/mmCIF, flat or gzipped). By default, easy-multimercluster generates three output files with the following prefixes: (1) `_cluster.tsv`, (2) `_rep_seq.fasta` and (3) `_cluster_report`. The first file (1) is a [tab-separated](#tab-separated-multimercluster) file describing the mapping from representative multimer to member, while the second file (2) contains only [representative sequences](#representative-multimer-fasta). The third file (3) is also a [tab-separated](#filtered-search-result) file describing filtered alignments.

Make sure chain names in PDB/mmcIF files does not contain underscores(_).

foldseek easy-multimercluster example/ clu tmp --multimer-tm-threshold 0.65 --chain-tm-threshold 0.5 --interface-lddt-threshold 0.65

#### Output MultimerCluster
##### Tab-separated multimercluster
```
5o002 5o002
194l2 194l2
194l2 193l2
10mh121 10mh121
10mh121 10mh114
10mh121 10mh119
```
##### Representative multimer fasta
```
#5o002
>5o002_A
SHGK...R
>5o002_B
SHGK...R
#194l2
>194l2_A0
KVFG...L
>194l2_A6
KVFG...L
#10mh121
...
```
##### Filtered search result
The `_cluster_report` contains `qcoverage, tcoverage, multimer qTm, multimer tTm, interface lddt, ustring, tstring` of alignments after filtering and before clustering.
```
5o0f2 5o0f2 1.000 1.000 1.000 1.000 1.000 1.000,0.000,0.000,0.000,1.000,0.000,0.000,0.000,1.000 0.000,0.000,0.000
5o0f2 5o0d2 1.000 1.000 0.999 0.992 1.000 0.999,0.000,-0.000,-0.000,0.999,-0.000,0.000,0.000,0.999 -0.004,-0.001,0.084
5o0f2 5o082 1.000 0.990 0.978 0.962 0.921 0.999,-0.025,-0.002,0.025,0.999,-0.001,0.002,0.001,0.999 -0.039,0.000,-0.253
```
The query and target coverages here represent the sum of the coverages of all aligned chains, divided by the total query and target multimer length respectively.

#### Important multimer cluster parameters

| Option | Category | Description |
|-------------------|-----------------|-----------------------------------------------------------------------------------------------------------|
| -e | Sensitivity | List matches below this E-value (range 0.0-inf, default: 0.001); increasing it reports more distant structures |
| --alignment-type| Alignment | 0: 3Di Gotoh-Smith-Waterman (local, not recommended), 1: TMalign (global, slow), 2: 3Di+AA Gotoh-Smith-Waterman (local, default) |
| -c | Alignment | List matches above this fraction of aligned (covered) residues (see --cov-mode) (default: 0.0); higher coverage = more global alignment |
| --cov-mode | Alignment | 0: coverage of query and target (cluster multimers only with same chain numbers), 1: coverage of target, 2: coverage of query |
| --multimer-tm-threshold | Alignment | accept alignments with multimer alignment TMscore > thr |
| --chain-tm-threshold | Alignment | accept alignments if every single chain TMscore > thr |
| --interface-lddt-threshold | Alignment | accept alignments with an interface LDDT score > thr |

## Main Modules
- `easy-search` fast protein structure search
- `easy-cluster` fast protein structure clustering
- `easy-multimersearch` fast protein multimer-level structure search
- `easy-multimercluster` fast protein multimer-level structure clustering
- `createdb` create a database from protein structures (PDB,mmCIF, mmJSON)
- `databases` download pre-assembled databases

Expand All @@ -324,16 +399,6 @@ foldseek createtsv queryDB targetDB aln_tmscore aln_tmscore.tsv

Output format `aln_tmscore.tsv`: query and target identifiers, TMscore, translation(3) and rotation vector=(3x3)

### Cluster search results
The following command performs an all-against-all alignments of the input structures and retains only the alignments, which cover 80% of the sequence (-c 0.8) (read more about alignment coverage options [here](https://github.com/soedinglab/MMseqs2/wiki#how-to-set-the-right-alignment-coverage-to-cluster)). It then clusters the results using a greedy set cover algorithm. The clustering mode can be adjusted using --cluster-mode, read more [here](https://github.com/soedinglab/MMseqs2/wiki#clustering-modes). The clustering output format is described [here](https://github.com/soedinglab/MMseqs2/wiki#cluster-tsv-format).

```
foldseek createdb example/ db
foldseek search db db aln tmpFolder -c 0.8
foldseek clust db aln clu
foldseek createtsv db db clu clu.tsv
```

### Query centered multiple sequence alignment
Foldseek can output multiple sequence alignments in a3m format using the following commands.
To convert a3m to FASTA format, the following script can be used [reformat.pl](https://raw.githubusercontent.com/soedinglab/hh-suite/master/scripts/reformat.pl) (`reformat.pl in.a3m out.fas`).
Expand Down
8 changes: 4 additions & 4 deletions lib/foldseek/azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -121,10 +121,10 @@ jobs:
targetPath: $(Build.SourcesDirectory)/build/src/foldseek
artifactName: foldseek-linux-$(SIMD)

- job: build_macos_11
displayName: macOS 11
- job: build_macos
displayName: macOS
pool:
vmImage: 'macos-11'
vmImage: 'macos-12'
steps:
- checkout: self
submodules: true
Expand Down Expand Up @@ -153,7 +153,7 @@ jobs:
pool:
vmImage: 'ubuntu-latest'
dependsOn:
- build_macos_11
- build_macos
- build_ubuntu_2004
- build_ubuntu_cross_2004
steps:
Expand Down
2 changes: 2 additions & 0 deletions lib/foldseek/data/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ set(COMPILED_RESOURCES
vendor.js.zst
multimersearch.sh
easymultimersearch.sh
multimercluster.sh
easymultimercluster.sh
)

set(GENERATED_OUTPUT_HEADERS "")
Expand Down
163 changes: 163 additions & 0 deletions lib/foldseek/data/easymultimercluster.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
#!/bin/sh -e
fail() {
echo "Error: $1"
exit 1
}

notExists() {
[ ! -f "$1" ]
}

exists() {
[ -f "$1" ]
}

abspath() {
if [ -d "$1" ]; then
(cd "$1"; pwd)
elif [ -f "$1" ]; then
if [ -z "${1##*/*}" ]; then
echo "$(cd "${1%/*}"; pwd)/${1##*/}"
else
echo "$(pwd)/$1"
fi
elif [ -d "$(dirname "$1")" ]; then
echo "$(cd "$(dirname "$1")"; pwd)/$(basename "$1")"
fi
}

mapCmplName2ChainKeys() {
awk -F"\t" 'FNR==1 {++fIndex}
fIndex==1 {
repName[$1]=1
if (match($1, /MODEL/)){
tmpName[$1]=1
}else{
tmpName[$1"_MODEL_1"]=1
}
next
}
fIndex==2{
if (match($2, /MODEL/)){
if ($2 in tmpName){
repId[$1]=1
}else{
ho[1]=1
}
}else{
if ($2 in repName){
repId[$1]=1
}
}
next
}
{
if ($3 in repId){
print $1
}
}
' "${1}" "${2}.source" "${2}.lookup" > "${3}"
}

postprocessFasta() {
awk ' BEGIN {FS=">"}
$0 ~/^>/ {
# match($2, /(.*).pdb*/)
split($2,parts,"_")
complex=""
for (j = 1; j < length(parts); j++) {
complex = complex parts[j]
if (j < length(parts)-1){
complex=complex"_"
}
}
if (!(complex in repComplex)) {
print "#"complex
repComplex[complex] = ""
}
}
{print $0}
' "${1}" > "${1}.tmp" && mv "${1}.tmp" "${1}"
}

if notExists "${TMP_PATH}/query.dbtype"; then
# shellcheck disable=SC2086
"$MMSEQS" createdb "${INPUT}" "${TMP_PATH}/query" ${CREATEDB_PAR} \
|| fail "query createdb died"
fi

if notExists "${TMP_PATH}/multimer_clu.dbtype"; then
# shellcheck disable=SC2086
"$MMSEQS" multimercluster "${TMP_PATH}/query" "${TMP_PATH}/multimer_clu" "${TMP_PATH}" ${MULTIMERCLUSTER_PAR} \
|| fail "Multimercluster died"
fi

SOURCE="${TMP_PATH}/query"
INPUT="${TMP_PATH}/latest/multimer_db"
if notExists "${TMP_PATH}/cluster.tsv"; then
# shellcheck disable=SC2086
"$MMSEQS" createtsv "${INPUT}" "${INPUT}" "${TMP_PATH}/multimer_clu" "${TMP_PATH}/cluster.tsv" ${THREADS_PAR} \
|| fail "Convert Alignments died"
# shellcheck disable=SC2086
"$MMSEQS" createtsv "${INPUT}" "${INPUT}" "${TMP_PATH}/multimer_clu_filt_info" "${TMP_PATH}/cluster_report" ${THREADS_PAR} \
|| fail "Convert Alignments died"
fi

if notExists "${TMP_PATH}/multimer_rep_seqs.dbtype"; then
mapCmplName2ChainKeys "${TMP_PATH}/cluster.tsv" "${SOURCE}" "${TMP_PATH}/rep_seqs.list"
# shellcheck disable=SC2086
"$MMSEQS" createsubdb "${TMP_PATH}/rep_seqs.list" "${SOURCE}" "${TMP_PATH}/multimer_rep_seqs" ${CREATESUBDB_PAR} \
|| fail "createsubdb died"
fi

if notExists "${TMP_PATH}/multimer_rep_seq.fasta"; then
# shellcheck disable=SC2086
"$MMSEQS" result2flat "${SOURCE}" "${SOURCE}" "${TMP_PATH}/multimer_rep_seqs" "${TMP_PATH}/multimer_rep_seq.fasta" ${VERBOSITY_PAR} \
|| fail "result2flat died"
postprocessFasta "${TMP_PATH}/multimer_rep_seq.fasta"
fi

#TODO: generate fasta file for all sequences
# if notExists "${TMP_PATH}/multimer_all_seqs.fasta"; then
# # shellcheck disable=SC2086
# "$MMSEQS" createseqfiledb "${INPUT}" "${TMP_PATH}/multimer_clu" "${TMP_PATH}/multimer_clust_seqs" ${THREADS_PAR} \
# || fail "Result2repseq died"

# # shellcheck disable=SC2086
# "$MMSEQS" result2flat "${INPUT}" "${INPUT}" "${TMP_PATH}/multimer_clust_seqs" "${TMP_PATH}/multimer_all_seqs.fasta" ${VERBOSITY_PAR} \
# || fail "result2flat died"
# fi

# mv "${TMP_PATH}/multimer_all_seqs.fasta" "${RESULT}_all_seqs.fasta"
mv "${TMP_PATH}/multimer_rep_seq.fasta" "${RESULT}_rep_seq.fasta"
mv "${TMP_PATH}/cluster.tsv" "${RESULT}_cluster.tsv"
mv "${TMP_PATH}/cluster_report" "${RESULT}_cluster_report"

if [ -n "${REMOVE_TMP}" ]; then
rm "${INPUT}.0"
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${TMP_PATH}/multimer_db" ${VERBOSITY_PAR}
# shellcheck disable=SC2086
# "$MMSEQS" rmdb "${TMP_PATH}/multimer_clu_seqs" ${VERBOSITY_PAR}
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${TMP_PATH}/multimer_rep_seqs" ${VERBOSITY_PAR}
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${TMP_PATH}/multimer_rep_seqs_h" ${VERBOSITY_PAR}
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${TMP_PATH}/complex_clu" ${VERBOSITY_PAR}
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${TMP_PATH}/query" ${VERBOSITY_PAR}
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${TMP_PATH}/query_h" ${VERBOSITY_PAR}
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${INPUT}" ${VERBOSITY_PAR}
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${INPUT}_h" ${VERBOSITY_PAR}
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${TMP_PATH}/query_ca" ${VERBOSITY_PAR}
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${TMP_PATH}/query_ss" ${VERBOSITY_PAR}
rm "${TMP_PATH}/rep_seqs.list"
rm -rf "${TMP_PATH}/latest"
rm -f "${TMP_PATH}/easymultimercluster.sh"
fi
6 changes: 6 additions & 0 deletions lib/foldseek/data/easystructuresearch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,12 @@ if notExists "${TMP_PATH}/alis.dbtype"; then
|| fail "Convert Alignments died"
fi

if [ -n "${TAXONOMY}" ]; then
# shellcheck disable=SC2086
"$MMSEQS" taxonomyreport "${TARGET}${INDEXEXT}" "${INTERMEDIATE}" "${RESULTS}_report" ${TAXONOMYREPORT_PAR} \
|| fail "taxonomyreport died"
fi

if [ -n "${REMOVE_TMP}" ]; then
if [ -n "${GREEDY_BEST_HITS}" ]; then
# shellcheck disable=SC2086
Expand Down
Loading

0 comments on commit 1c2a03b

Please sign in to comment.