These scripts are scrapers to scrape substance informations from biomedical databases.
Running: directly running this script with Python3 (dependences: pandas, requests, BeautifulSoup & lxml parser, urllib3)
Before running scraper, check whether your environment satisfied dependences.
Version 1.1: searching for protein subcellular loaction & secreted protein selection from Uniprot.
Update:
For some proteins, there will be not available Uniprot subcellular location annotation in database.
Gene Ontology (Cellular Component) terms are added into search targets to solving this issue. You will see two columns of subcellular location in final table, corresponding to Uniprot annotation and GO annotation.
1. Input file: Input directory of your ID file, which is a column list of protein Uniprot IDs, csv or xlsx format recommended.
e.g. D:\Users\work_dir\test.csv
2. Output file: Output directory of output file (csv format) containing Uniprot IDs and corresponding subcelluar location.
NO filename extension is fine and recommended, output filename will end with sub_loc.csv
automatically.
e.g. D:\Users\work_dir\out_test ---> D:\Users\work_dir\out_test_sub_loc.tsv
Accept Uppercase Y
or N
, corresponding to select secreted protein as a seperated file xxx_secreted.csv
, xxx
is same as output name in 2
You can check Uniprot ID csv or xlsx file in Example_data/Uniprot_scraper_example
. Here is an example to running it:
- When showing:
Please input your Uniprot ID list file directory (csv format recommended, e.g. D:\Users\work_dir\test.csv):
Enter directory of input file Metabolite_searching.xlsx on your computer.
For example, if I put it under the file folder C:\User\Desktop\
, I should enter:
C:\User\Desktop\Protein-20240315.csv
- When showing:
Please input your output file directory (with output name you want, e.g. D:\Users\work_dir\out_test):
For example, if I want to put result under the file folder C:\User\Desktop\Result\
and name it as Result_HMDB
, I should enter:
C:\User\Desktop\Result\20240315
- When showing:
Do you want to select secreted proteins as an independent file? (Y/N):
For example, if I want to select all secreted protein or proteins located in extracellular space, enter Y
.
Finally, you will get the same output file as 20240315_sub_loc.csv
and 20240315_secreted.csv
in Example_data/Uniprot_scraper_example
.
Version 1.0: searching for description of metabolite in HMDB according to CAS ID.
e.g. D:\Users\work_dir\test.csv
2. Output file: Output directory of output file (csv format) containing CAS IDs and corresponding description in HMDB.
NO filename extension recommended, output filename will end with .csv
or xlsx
automatically (same as your input).
e.g. D:\Users\work_dir\out_test ---> D:\Users\work_dir\out_test.csv
You will need this paramter only when you are using xlsx file as input. It determines which sheet the program will deal with.
e.g. 1 = Sheet1, 2 = Sheet2. You can also enter the name of sheet.
If you want all sheets searched, just press Enter
.
You can check xlsx file in Example_data/HMDB_scraper_example
. Here is an example to running it:
- When showing:
Please input your list file directory of metabolites (csv or xlsx format recommended, e.g. D:\Users\work_dir\test.csv):
Enter directory of input file Metabolite_searching.xlsx on your computer.
For example, if I put it under the file folder C:\User\Desktop\
, I should enter:
C:\User\Desktop\Metabolite_searching.xlsx
- When showing:
Please input your output file directory (with output name you want, e.g. D:\Users\work_dir\out_test):
For example, if I want to put result under the file folder C:\User\Desktop\Result\
and name it as Result_HMDB
, I should enter:
C:\User\Desktop\Result\Result_HMDB
- When showing:
Which sheet do you want to search for?
For example, if I want to search all sheets in input file, just press Enter
on your keyboard.
Finally, you will get the same output file as Result_HMDB.xlsx
in Example_data/HMDB_scraper_example
. Have a try!
Thanks Yusong Zhang from Shandong University for asking me for develop these tools.
For more requirements and information worth, I will keep to update these scrapers.
If any issue, please contact with [email protected]