Merge commit '38a7b3cc53d6d65dbdfd65268ab07bd0951b446e'

steineggerlab · Dec 4, 2024 · 1c2a03b · 1c2a03b
2 parents ecae81c + 38a7b3c
commit 1c2a03b
Show file tree

Hide file tree

Showing 58 changed files with 2,586 additions and 386 deletions.
diff --git a/lib/foldseek/.github/workflows/mac-arm64.yml b/lib/foldseek/.github/workflows/mac-arm64.yml
@@ -8,16 +8,20 @@ on:
 
 jobs:
   build:
-    runs-on: [self-hosted, macOS, ARM64]
+    runs-on: macos-latest
     steps:
       - uses: actions/checkout@v3
         with:
           submodules: true
 
+      - name: Dependencies
+        run: |
+          brew install -f --overwrite cmake libomp rustup
+          rustup-init --profile minimal -q -y
+
       - name: Build
         run: |
           mkdir -p build
-          rustup update
           cd build
           LIBOMP=$(brew --prefix libomp)
           cmake \

diff --git a/lib/foldseek/README.md b/lib/foldseek/README.md
@@ -14,26 +14,45 @@ Foldseek enables fast and sensitive comparisons of large protein structure sets.
 # Table of Contents
 
 - [Foldseek](#foldseek)
-- [Webserver](#webserver)
-- [Installation](#installation)
-- [Memory requirements](#memory-requirements)
-- [Tutorial Video](#tutorial-video)
-- [Documentation](#documentation)
-- [Quick Start](#quick-start)
-  - [Search](#search)
-    - [Output](#output-search)
-    - [Important Parameters](#important-search-parameters)
-    - [Alignment Mode](#alignment-mode)
-    - [Structure search from FASTA input](#structure-search-from-fasta-input)
-  - [Databases](#databases)
-    - [Create Custom Databases and Indexes](#create-custom-databases-and-indexes)
-  - [Cluster](#cluster)
-    - [Output](#output-cluster)
-    - [Important Parameters](#important-cluster-parameters)
-  - [Multimer](#multimersearch)
-    - [Output](#multimer-search-output)
-- [Main Modules](#main-modules)
-- [Examples](#examples)
+  - [Publications](#publications)
+- [Table of Contents](#table-of-contents)
+  - [Webserver](#webserver)
+  - [Installation](#installation)
+  - [Memory requirements](#memory-requirements)
+  - [Tutorial Video](#tutorial-video)
+  - [Documentation](#documentation)
+  - [Quick start](#quick-start)
+    - [Search](#search)
+      - [Output Search](#output-search)
+        - [Tab-separated](#tab-separated)
+        - [Superpositioned Cα only PDB files](#superpositioned-cα-only-pdb-files)
+        - [Interactive HTML](#interactive-html)
+      - [Important search parameters](#important-search-parameters)
+      - [Alignment Mode](#alignment-mode)
+      - [Structure search from FASTA input](#structure-search-from-fasta-input)
+    - [Databases](#databases)
+      - [Create custom databases and indexes](#create-custom-databases-and-indexes)
+    - [Cluster](#cluster)
+      - [Output Cluster](#output-cluster)
+        - [Tab-separated cluster](#tab-separated-cluster)
+        - [Representative fasta](#representative-fasta)
+        - [All member fasta](#all-member-fasta)
+      - [Important cluster parameters](#important-cluster-parameters)
+    - [Multimersearch](#multimersearch)
+      - [Using Multimersearch](#using-multimersearch)
+      - [Multimer Search Output](#multimer-search-output)
+        - [Tab-separated-complex](#tab-separated-complex)
+        - [Complex Report](#complex-report)
+    - [Multimercluster](#multimercluster)
+      - [Output MultimerCluster](#output-multimercluster)
+        - [Tab-separated multimercluster](#tab-separated-multimercluster)
+        - [Representative multimer fasta](#representative-multimer-fasta)
+        - [Filtered search result](#filtered-search-result)
+      - [Important multimer cluster parameters](#important-multimer-cluster-parameters)
+  - [Main Modules](#main-modules)
+  - [Examples](#examples)
+    - [Rescore aligments using TMscore](#rescore-aligments-using-tmscore)
+    - [Query centered multiple sequence alignment](#query-centered-multiple-sequence-alignment)
 
 ## Webserver 
 Search your protein structures against the [AlphaFoldDB](https://alphafold.ebi.ac.uk/) and [PDB](https://www.rcsb.org/) in seconds using the Foldseek webserver ([code](https://github.com/soedinglab/mmseqs2-app)): [search.foldseek.com](https://search.foldseek.com) 🚀
@@ -238,6 +257,7 @@ MCAR...Q
 | --cov-mode      | Alignment  | 0: coverage of query and target, 1: coverage of target, 2: coverage of query                               |
 | --min-seq-id      | Alignment  | the minimum sequence identity to be clustered                               |
 | --tmscore-threshold      | Alignment  | accept alignments with an alignment TMscore > thr                               |
+| --tmscore-threshold-mode    | Alignment  | normalize TMscore by 0: alignment, 1: representative, 2: member length                             |
 | --lddt-threshold      | Alignment  | accept alignments with an alignment LDDT score > thr                               |
 
 
@@ -300,9 +320,64 @@ The default output fields are: `query,target,fident,alnlen,mismatch,gapopen,qsta
 1tim.pdb.gz 8tim.pdb.gz A,B A,B 0.98941 0.98941 0.999983,0.000332,0.005813,-0.000373,0.999976,0.006884,-0.005811,-0.006886,0.999959 0.298992,0.060047,0.565875  0
 ```
 
+### Multimercluster
+The `easy-multimercluster` module is designed for multimer-level structural clustering(supported input formats: PDB/mmCIF, flat or gzipped). By default, easy-multimercluster generates three output files with the following prefixes: (1) `_cluster.tsv`, (2) `_rep_seq.fasta` and (3) `_cluster_report`.  The first file (1) is a [tab-separated](#tab-separated-multimercluster) file describing the mapping from representative multimer to member, while the second file (2) contains only [representative sequences](#representative-multimer-fasta). The third file (3) is also a [tab-separated](#filtered-search-result) file describing filtered alignments.
+
+Make sure chain names in PDB/mmcIF files does not contain underscores(_).
+
+    foldseek easy-multimercluster example/ clu tmp --multimer-tm-threshold 0.65 --chain-tm-threshold 0.5 --interface-lddt-threshold 0.65
+
+#### Output MultimerCluster
+##### Tab-separated multimercluster
+```
+5o002	   5o002
+194l2	   194l2
+194l2	   193l2
+10mh121	 10mh121
+10mh121	 10mh114
+10mh121	 10mh119
+```
+##### Representative multimer fasta
+```
+#5o002
+>5o002_A
+SHGK...R
+>5o002_B
+SHGK...R
+#194l2
+>194l2_A0
+KVFG...L
+>194l2_A6
+KVFG...L
+#10mh121
+...
+```
+##### Filtered search result
+The `_cluster_report` contains `qcoverage, tcoverage, multimer qTm, multimer tTm, interface lddt, ustring, tstring` of alignments after filtering and before clustering. 
+```
+5o0f2	5o0f2	1.000	1.000	1.000	1.000	1.000	1.000,0.000,0.000,0.000,1.000,0.000,0.000,0.000,1.000	0.000,0.000,0.000
+5o0f2	5o0d2	1.000	1.000	0.999	0.992	1.000	0.999,0.000,-0.000,-0.000,0.999,-0.000,0.000,0.000,0.999	-0.004,-0.001,0.084
+5o0f2	5o082	1.000	0.990	0.978	0.962	0.921	0.999,-0.025,-0.002,0.025,0.999,-0.001,0.002,0.001,0.999	-0.039,0.000,-0.253
+```
+The query and target coverages here represent the sum of the coverages of all aligned chains, divided by the total query and target multimer length respectively.
+
+#### Important multimer cluster parameters
+
+| Option            | Category        | Description                                                                                               |
+|-------------------|-----------------|-----------------------------------------------------------------------------------------------------------|
+| -e              | Sensitivity     | List matches below this E-value (range 0.0-inf, default: 0.001); increasing it reports more distant structures |
+| --alignment-type| Alignment       | 0: 3Di Gotoh-Smith-Waterman (local, not recommended), 1: TMalign (global, slow), 2: 3Di+AA Gotoh-Smith-Waterman (local, default) |
+| -c              | Alignment  | List matches above this fraction of aligned (covered) residues (see --cov-mode) (default: 0.0); higher coverage = more global alignment |
+| --cov-mode      | Alignment  | 0: coverage of query and target (cluster multimers only with same chain numbers), 1: coverage of target, 2: coverage of query |
+| --multimer-tm-threshold      | Alignment  | accept alignments with multimer alignment TMscore > thr |
+| --chain-tm-threshold      | Alignment  | accept alignments if every single chain TMscore > thr |
+| --interface-lddt-threshold      | Alignment  | accept alignments with an interface LDDT score > thr |
+
 ## Main Modules
 - `easy-search`       fast protein structure search  
 - `easy-cluster`      fast protein structure clustering  
+- `easy-multimersearch`       fast protein multimer-level structure search  
+- `easy-multimercluster`       fast protein multimer-level structure clustering  
 - `createdb`          create a database from protein structures (PDB,mmCIF, mmJSON)
 - `databases`         download pre-assembled databases
 
@@ -324,16 +399,6 @@ foldseek createtsv queryDB targetDB aln_tmscore aln_tmscore.tsv
 
 Output format `aln_tmscore.tsv`: query and target identifiers, TMscore, translation(3) and rotation vector=(3x3)
 
-### Cluster search results 
-The following command performs an all-against-all alignments of the input structures and retains only the alignments, which cover 80% of the sequence (-c 0.8) (read more about alignment coverage options [here](https://github.com/soedinglab/MMseqs2/wiki#how-to-set-the-right-alignment-coverage-to-cluster)). It then clusters the results using a greedy set cover algorithm. The clustering mode can be adjusted using --cluster-mode, read more [here](https://github.com/soedinglab/MMseqs2/wiki#clustering-modes). The clustering output format is described [here](https://github.com/soedinglab/MMseqs2/wiki#cluster-tsv-format).
-
-```
-foldseek createdb example/ db
-foldseek search db db aln tmpFolder -c 0.8 
-foldseek clust db aln clu
-foldseek createtsv db db clu clu.tsv
-```
-
 ### Query centered multiple sequence alignment 
 Foldseek can output multiple sequence alignments in a3m format using the following commands. 
 To convert a3m to FASTA format, the following script can be used [reformat.pl](https://raw.githubusercontent.com/soedinglab/hh-suite/master/scripts/reformat.pl) (`reformat.pl in.a3m out.fas`).

diff --git a/lib/foldseek/azure-pipelines.yml b/lib/foldseek/azure-pipelines.yml
@@ -121,10 +121,10 @@ jobs:
           targetPath: $(Build.SourcesDirectory)/build/src/foldseek
           artifactName: foldseek-linux-$(SIMD)
 
-  - job: build_macos_11
-    displayName: macOS 11
+  - job: build_macos
+    displayName: macOS
     pool:
-      vmImage: 'macos-11'
+      vmImage: 'macos-12'
     steps:
       - checkout: self
         submodules: true
@@ -153,7 +153,7 @@ jobs:
     pool:
       vmImage: 'ubuntu-latest'
     dependsOn:
-      - build_macos_11
+      - build_macos
       - build_ubuntu_2004
       - build_ubuntu_cross_2004
     steps:

diff --git a/lib/foldseek/data/CMakeLists.txt b/lib/foldseek/data/CMakeLists.txt
@@ -15,6 +15,8 @@ set(COMPILED_RESOURCES
         vendor.js.zst
         multimersearch.sh
         easymultimersearch.sh
+        multimercluster.sh
+        easymultimercluster.sh
         )
 
 set(GENERATED_OUTPUT_HEADERS "")

diff --git a/lib/foldseek/data/easymultimercluster.sh b/lib/foldseek/data/easymultimercluster.sh
@@ -0,0 +1,163 @@
+#!/bin/sh -e
+fail() {
+    echo "Error: $1"
+    exit 1
+}
+
+notExists() {
+	[ ! -f "$1" ]
+}
+
+exists() {
+	[ -f "$1" ]
+}
+
+abspath() {
+    if [ -d "$1" ]; then
+        (cd "$1"; pwd)
+    elif [ -f "$1" ]; then
+        if [ -z "${1##*/*}" ]; then
+            echo "$(cd "${1%/*}"; pwd)/${1##*/}"
+        else
+            echo "$(pwd)/$1"
+        fi
+    elif [ -d "$(dirname "$1")" ]; then
+        echo "$(cd "$(dirname "$1")"; pwd)/$(basename "$1")"
+    fi
+}
+
+mapCmplName2ChainKeys() {
+    awk -F"\t" 'FNR==1 {++fIndex}
+        fIndex==1 {
+            repName[$1]=1
+            if (match($1, /MODEL/)){
+                tmpName[$1]=1
+            }else{
+                tmpName[$1"_MODEL_1"]=1 
+            }
+            next
+        }
+        fIndex==2{
+            if (match($2, /MODEL/)){
+                if ($2 in tmpName){
+                repId[$1]=1
+                }else{
+                    ho[1]=1
+                }
+            }else{
+                if ($2 in repName){
+                repId[$1]=1
+                }
+            }
+            next
+        }
+        {
+            if ($3 in repId){
+                print $1
+            }
+        }
+    ' "${1}" "${2}.source" "${2}.lookup" > "${3}"
+}
+
+postprocessFasta() {
+    awk ' BEGIN {FS=">"}
+    $0 ~/^>/ {
+        # match($2, /(.*).pdb*/)
+        split($2,parts,"_")
+        complex=""
+        for (j = 1; j < length(parts); j++) {
+            complex = complex parts[j]
+            if (j < length(parts)-1){
+                complex=complex"_" 
+            }
+        }
+        if (!(complex in repComplex)) {
+            print "#"complex
+            repComplex[complex] = ""
+        }
+    }
+    {print $0}
+    ' "${1}" > "${1}.tmp" && mv "${1}.tmp" "${1}"
+}
+
+if notExists "${TMP_PATH}/query.dbtype"; then
+    # shellcheck disable=SC2086
+    "$MMSEQS" createdb "${INPUT}" "${TMP_PATH}/query" ${CREATEDB_PAR} \
+        || fail "query createdb died"
+fi
+
+if notExists "${TMP_PATH}/multimer_clu.dbtype"; then
+    # shellcheck disable=SC2086
+    "$MMSEQS" multimercluster "${TMP_PATH}/query" "${TMP_PATH}/multimer_clu" "${TMP_PATH}" ${MULTIMERCLUSTER_PAR} \
+        || fail "Multimercluster died"
+fi
+
+SOURCE="${TMP_PATH}/query"
+INPUT="${TMP_PATH}/latest/multimer_db"
+if notExists "${TMP_PATH}/cluster.tsv"; then
+    # shellcheck disable=SC2086
+    "$MMSEQS" createtsv "${INPUT}" "${INPUT}" "${TMP_PATH}/multimer_clu" "${TMP_PATH}/cluster.tsv" ${THREADS_PAR}   \
+        || fail "Convert Alignments died"
+    # shellcheck disable=SC2086
+    "$MMSEQS" createtsv "${INPUT}" "${INPUT}" "${TMP_PATH}/multimer_clu_filt_info" "${TMP_PATH}/cluster_report" ${THREADS_PAR}  \
+        || fail "Convert Alignments died"
+fi
+
+if notExists "${TMP_PATH}/multimer_rep_seqs.dbtype"; then
+    mapCmplName2ChainKeys "${TMP_PATH}/cluster.tsv" "${SOURCE}" "${TMP_PATH}/rep_seqs.list" 
+    # shellcheck disable=SC2086
+    "$MMSEQS" createsubdb "${TMP_PATH}/rep_seqs.list" "${SOURCE}" "${TMP_PATH}/multimer_rep_seqs" ${CREATESUBDB_PAR} \
+        || fail "createsubdb died"
+fi
+
+if notExists "${TMP_PATH}/multimer_rep_seq.fasta"; then
+    # shellcheck disable=SC2086
+    "$MMSEQS" result2flat "${SOURCE}" "${SOURCE}"  "${TMP_PATH}/multimer_rep_seqs" "${TMP_PATH}/multimer_rep_seq.fasta" ${VERBOSITY_PAR} \
+            || fail "result2flat died"
+    postprocessFasta "${TMP_PATH}/multimer_rep_seq.fasta"
+fi
+
+#TODO: generate fasta file for all sequences
+# if notExists "${TMP_PATH}/multimer_all_seqs.fasta"; then
+#     # shellcheck disable=SC2086
+#     "$MMSEQS" createseqfiledb "${INPUT}" "${TMP_PATH}/multimer_clu" "${TMP_PATH}/multimer_clust_seqs" ${THREADS_PAR} \
+#             || fail "Result2repseq  died"
+
+#     # shellcheck disable=SC2086
+#     "$MMSEQS" result2flat "${INPUT}" "${INPUT}" "${TMP_PATH}/multimer_clust_seqs" "${TMP_PATH}/multimer_all_seqs.fasta" ${VERBOSITY_PAR} \
+#             || fail "result2flat died"
+# fi
+
+# mv "${TMP_PATH}/multimer_all_seqs.fasta"  "${RESULT}_all_seqs.fasta"
+mv "${TMP_PATH}/multimer_rep_seq.fasta"  "${RESULT}_rep_seq.fasta"
+mv "${TMP_PATH}/cluster.tsv"  "${RESULT}_cluster.tsv"
+mv "${TMP_PATH}/cluster_report"  "${RESULT}_cluster_report"
+
+if [ -n "${REMOVE_TMP}" ]; then
+    rm "${INPUT}.0"
+    # shellcheck disable=SC2086
+    "$MMSEQS" rmdb "${TMP_PATH}/multimer_db" ${VERBOSITY_PAR}
+    # shellcheck disable=SC2086
+    # "$MMSEQS" rmdb "${TMP_PATH}/multimer_clu_seqs" ${VERBOSITY_PAR}
+    # shellcheck disable=SC2086
+    "$MMSEQS" rmdb "${TMP_PATH}/multimer_rep_seqs" ${VERBOSITY_PAR}
+    # shellcheck disable=SC2086
+    "$MMSEQS" rmdb "${TMP_PATH}/multimer_rep_seqs_h" ${VERBOSITY_PAR}
+    # shellcheck disable=SC2086
+    "$MMSEQS" rmdb "${TMP_PATH}/complex_clu" ${VERBOSITY_PAR}
+    # shellcheck disable=SC2086
+    "$MMSEQS" rmdb "${TMP_PATH}/query" ${VERBOSITY_PAR}
+    # shellcheck disable=SC2086
+    "$MMSEQS" rmdb "${TMP_PATH}/query_h" ${VERBOSITY_PAR}
+    # shellcheck disable=SC2086
+    "$MMSEQS" rmdb "${INPUT}" ${VERBOSITY_PAR}
+    # shellcheck disable=SC2086
+    "$MMSEQS" rmdb "${INPUT}_h" ${VERBOSITY_PAR}
+    # shellcheck disable=SC2086
+    "$MMSEQS" rmdb "${TMP_PATH}/query_ca" ${VERBOSITY_PAR}
+    # shellcheck disable=SC2086
+    "$MMSEQS" rmdb "${TMP_PATH}/query_ss" ${VERBOSITY_PAR}
+    rm "${TMP_PATH}/rep_seqs.list"
+    rm -rf "${TMP_PATH}/latest"
+    rm -f "${TMP_PATH}/easymultimercluster.sh"
+fi
diff --git a/lib/foldseek/data/easystructuresearch.sh b/lib/foldseek/data/easystructuresearch.sh
@@ -51,6 +51,12 @@ if notExists "${TMP_PATH}/alis.dbtype"; then
         || fail "Convert Alignments died"
 fi
 
+if [ -n "${TAXONOMY}" ]; then
+    # shellcheck disable=SC2086
+    "$MMSEQS" taxonomyreport "${TARGET}${INDEXEXT}" "${INTERMEDIATE}" "${RESULTS}_report" ${TAXONOMYREPORT_PAR} \
+        || fail "taxonomyreport died"
+fi
+
 if [ -n "${REMOVE_TMP}" ]; then
     if [ -n "${GREEDY_BEST_HITS}" ]; then
         # shellcheck disable=SC2086