Unveiling Medical Insights: Advanced Topic Extraction from Scientific Articles

This repository accompanies the research paper "Unveiling Medical Insights: Advanced Topic Extraction from Scientific Articles," which explores the use of advanced Natural Language Processing (NLP) techniques for extracting key topics from scientific literature, with a focus on breast cancer research. The work leverages the TextRank algorithm and Large Language Models (LLMs) using the TripleA tool to analyze and extract topics from nearly 10,000 scientific paper abstracts.

Key Features:
Usage
Basic concept
- Topic Extraction
- Co-ocurrence topic map
How to use
- Install from source
- Run Step by Step
Dataset
GraphMl Files
Graph Info
Article
Contributors
License

Key Features:

Data Extraction: The repository includes scripts for retrieving and processing scientific paper abstracts using the PubMed API.
Topic Extraction: Implements TextRank and an open-source LLM (Mistral) to extract and compare topics from the abstracts.
Graph Construction: Generates co-occurrence graphs of extracted topics, enabling visualization and further analysis of relationships between key terms.
Visualization: Utilizes VOSviewer for visualizing the co-occurrence graphs, providing insights into trends and patterns within the data.
Comparative Analysis: The repository offers a comparative analysis of the performance of TextRank and LLMs in topic extraction, showing that LLMs tend to produce more clustered and interconnected topic networks.

Usage

The pipeline is designed for reproducibility, allowing researchers to apply the methodology to other domains or datasets. Potential applications include bibliometric analysis, trend identification in research fields, and development of knowledge graphs for clinical decision support. The full code, datasets, and documentation are available within this repository to facilitate further research and application in the biomedical field.

Basic concept

Topic Extraction

Topic Extraction, also known as "automatic topic discovery" or "topic modeling," is a text analysis technique used to identify the main themes or concepts in a collection of documents or a large text body. The goal is to automatically summarize and categorize the content by extracting the most relevant keywords, phrases, or topics, thereby enabling users to better understand and explore the information. Topic extraction algorithms detect patterns and probabilistically determine topics based on the frequency and distribution of words, as well as their co-occurrence in the text. These techniques help in a variety of applications, such as sentiment analysis, content recommendation, and search optimization.

Co-ocurrence topic map

A Co-occurrence Topic Map is a visual representation of the relationships between topics, keywords, or concepts based on their co-occurrence within a collection of documents or a large text body. In a Co-occurrence Topic Map, topics that frequently appear together in the text are connected by lines, edges, or proximity, indicating a thematic or semantic relationship between the connected items. The map can be used to explore and analyze the structure and content of a text corpus, identify key themes or trends, and support navigation and knowledge discovery. This type of visualization can provide a comprehensive and intuitive overview of the data, revealing hidden patterns and enabling users to gain insights that might be difficult to discern from the raw text alone.

How to use

If you want to use the outputs of this program and this article, you can use its dataset and perform other methods or other research on its data. GraphMl files have been prepared for different co-occurrence graphs that you can use. But if you want to start this program to generate a new dataset and get the outputs you want, you have to go through the pipeline steps completely. We have used TripleA library in this program. We have explained this in the "Install from source" section.

Install from source

Clone repository:

git clone https://github.com/mjafarpour87/medical-insights.git

or

git clone [email protected]:mjafarpour87/medical-insights.git

Create virtual environment:

python -m venv venv

Activate virtual environment:

Windows

$ .\venv\Scripts\activate

Linux

$ source venv/bin/activate

Install requirements:

pip install -r requirements.txt

You can install last version of triplea with this command

pip install git+https://github.com/EhsanBitaraf/triple-a.git

python -m spacy download en_core_web_sm

Run Step by Step

#	File Name	Description
Step 1	step01_check_config.py	Check TripleA Configuration
step 2	step02_get_pubmed.py	To retrieve relevant papers with minimum quality content, we used the search strategy keywords: `("Breast Cancer"[Title]) AND (Therapy[Title])`.
Step 3	step03_move_state_forward.py	In this step, "Triple A" operators were used to process paper metadata and content at different states, including extracting keywords and MeSH terms from the metadata.
Step 4	step04_extract_topic_textrank.py	Extract topic from abstract and title with method textrank
Step 5	step05_extract_topic_with_llm.py	Extract topic from abstract and title with LLM. In this step, a template has been used to extract topics from the abstract of the articles, which you can see here. We used the Mistral-7B-Instruct-v0.2 model for this.
Step 6	step06_repair_response.py	Repair Json format in response of LLM
Step 7	step07_export_dataset.py	Export Dataset
Step 8	step08_generate_co_occurrence_graph.py	Generate Co-occurrence graph and export GraphMl and VOSviewer

Dataset

In step 7, a dataset is formed that can be used in the next steps and various studies can be done on it. This dataset includes the title of the article, the list of topics extracted using LLM and the topics extracted using the Textrank method, as well as the keywords of the article. In addition, this dataset can be downloaded from here. The format of the dataset is Json.

Below is the format of each article in the dataset as Json:

{
        "title": "Review of recent preclinical and clinical research on ligand-targeted liposomes as delivery systems in triple negative breast cancer therapy.",
        "year": "2024",
        "pmid": "38520185",
        "keywords": [
            "Triple negative breast cancer",
            "drug carriers",
            "ligand-targeted liposomes",
            "liposome"
        ],
        "textrank_topics": [
            "TNBC treatment",
            "progressed TNBC treatment",
            "various treatment methods",
            "targeted treatment",
            "triple negative breast cancer therapy",
            "targeted drug carriers",
            "TNBC",
            "breast cancer patients",
            "appropriate treatment",
            "drug delivery"
        ],
        "llm_topics": [
            "Triple-negative breast cancer (TNBC)",
            "Chemotherapy",
            "Targeted treatment",
            "Liposomes",
            "Drug delivery",
            "Ligand-targeted liposomes",
            "TNBC therapy",
            "Preclinical research",
            "Clinical research",
            "MDR cancer cells"
        ]
    },

If you use this dataset in another scientific work, you can refer to it as follows:

Bitaraf, Ehsan (2024). Topic Extraction Dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.25533532

GraphMl Files

The output in GraphMl format has been extracted for all three co-occurrence graphs:

Co-occurrence topic graphs (Topic Extraction with LLM) GraphMl file
Co-occurrence topic graphs (Topic Extraction with TextRank) GraphMl file
Co-occurrence keyword graphs GraphMl file

Graph Info

Method	Graph Nodes	Graph Edges	Graph Average Degree	Graph Density	Graph Average Clustering Coefficient	Graph Degree Assortativity Coefficient	Components
LLM	45806	357482	7.804261450465004	0.00034076024235192684	0.9052766558811535	-0.06969792652358299	514
Keyword	15555	337659	21.707425265188043	0.0027912338003327816	0.8692939688222773	-0.1513162822347413	69
Textrank	41185	288024	6.9934199344421515	0.00033961829518464216	0.8905241147162223	-0.07659374291256468	86

A comparison of the constructed topic/keyword co-occurrence networks with metrics.

Topic Co-occurrence network using TextRank Algorithm

Topic Co-occurrence network using LLM

Keyword co-occurrence network

Article

The Paper is accepted and published at MIE 2024. To cite this work:

Bitaraf E, Jafarpour M, Shool S, Saboori Amleshi R. Unveiling Medical Insights: Advanced Topic Extraction from Scientific Articles. Stud Health Technol Inform. 2024 Aug 22;316:944-948. doi: 10.3233/SHTI240566. PMID: 39176947.

Contributors

Made with contrib.rocks.

Please see our contributing guidelines for more details on how to get involved.

License

This Repository is available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
.vscode		.vscode
assets/img/vos		assets/img/vos
output		output
src		src
.env.sample		.env.sample
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unveiling Medical Insights: Advanced Topic Extraction from Scientific Articles

Key Features:

Usage

Basic concept

Topic Extraction

Co-ocurrence topic map

How to use

Install from source

Run Step by Step

Dataset

GraphMl Files

Graph Info

Article

Contributors

License

About

Releases

Packages

Contributors 3

Languages

License

mjafarpour87/medical-insights

Folders and files

Latest commit

History

Repository files navigation

Unveiling Medical Insights: Advanced Topic Extraction from Scientific Articles

Key Features:

Usage

Basic concept

Topic Extraction

Co-ocurrence topic map

How to use

Install from source

Run Step by Step

Dataset

GraphMl Files

Graph Info

Article

Contributors

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages