Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the hmmsearch #79

Open
SergeyBaikal opened this issue Oct 14, 2024 · 3 comments
Open

Running the hmmsearch #79

SergeyBaikal opened this issue Oct 14, 2024 · 3 comments
Labels
question Further information is requested

Comments

@SergeyBaikal
Copy link

Dear authors, I find it difficult to run the program without knowing python. How can I run a similar command with the pyhmmer and extract only the best bit score? Could you please help me?

hmmsearch --tblout /home/Results_vog.txt -E 1e-5 --notextw --cpu 8 /home/Vog_219.hmm /home/proteins.fasta

Results_vog.txt > one_target_best_bit_score.tsv
@althonos
Copy link
Owner

althonos commented Nov 8, 2024

Hi Sergey, unfortunately I'm a bit at loss here, the whole point of PyHMMER is to make it easier to run HMMER from Python ; I have not started making a CLI so at the moment it's not so straightforward to run directly through the command line.

Unless you have more than one HMM in your query file, you will not benefit from multiprocessing either, so I'd still encourage using the original hmmsearch for that purpose.

Nevertheless, I can provide you with a Python script that does what you want if that's helpful. Do you need to get the best bit score per target sequence for a single HMM?

@SergeyBaikal
Copy link
Author

Dear althonos

Sorry for the long reply. I guess I didn't really understand what the PyHMMER is for.
Yes, I was looking for a script to extract the best matches for the hmmsearch.
But eventually came across a simple command
awk '!x[$3]++' ./My_data/Results.txt > ./My_data/besthit.txt
using hmmscan --tblout. Maybe that's not quite right.
If you have the ability to make a parser to extract the best result from hmmsearch I would appreciate it.

I found part of the solution here linnabrown/run_dbcan#27 (comment)
But I wanted a thing that would just show the best result from the results hmmsearch I got and not remove some matches.

@althonos
Copy link
Owner

Hi @SergeyBaikal,

This would probably work if you wanted to use PyHMMER: it uses a dictionary to store the best hit for each target, and then displays a small TSV table:

import pyhmmer

# load queries
alphabet = pyhmmer.easel.Alphabet.amino()
with pyhmmer.easel.SequenceFile("/home/proteins.fasta", digital=True, alphabet=alphabet) as seq_file:
    sequences = seq_file.read_block()

# record best hits by score
best_hits: dict[str, pyhmmer.plan7.Hit] = {}

# launch hmmer and collect best hit per target
with pyhmmer.plan7.HMMFile("/home/Vog_219.hmm") as hmm_file:
    for hits in pyhmmer.hmmer.hmmsearch(hmm_file, sequences):
        for hit in hits:
            hit_name = hit.name.decode()
            if hit_name not in best_hits or best_hits[hit_name].score < hit.score:
                best_hits[hit_name] = hit

# display best hits
for hit in best_hits.values():
    _decode = lambda x: None if x is None else x.decode()
    print(
        _decode(hit.hits.query.name),
        _decode(hit.hits.query.accession),
        _decode(hit.name),
        _decode(hit.accession),
        hit.evalue,
        hit.score,
        hit.bias,
        sep="\t"
    )

@althonos althonos added the question Further information is requested label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants