Skip to content

ricardofoliveira61/ML-data-analysis

Repository files navigation

Intelligent systems for bioinformatics- Group 1

This work is developed in the ambit of curricular unit intelligent systems for bioinformatics of the Bioinformatic Master by:

This work consists in the analysis of a dataset through the utilization of machine learning algorithms, recurring to Python as the programming language. The entire analysis is present on a Jupyter Notebook, organized in sections (explained later on) containing succinct explanations of the procedures and decisions taken throughout the analysis.

For this work we selected the GDSC1 dataset. This dataset contains the wet lab IC50 for 208 drugs in 1000 cancer cells lines and can be used to design models that can predict drug response since the same compound can have different levels of responses in different patients. With this we aim to design a model that given a pair of drug and cell line genomics profile can predict the drug response and find the best drug to treat certain patient. In this dataset the RMD normalized gene expression was used for cancer lines and the SMILES for drugs. Y is the log normalized IC50.

To have access to the dataset use the following code

from tdc.multi_pred import DrugRes
data = DrugRes(name = 'GDSC1')
split = data.get_split()

Notebook sections

1. Preprocessing and data exploration

  • Review of all documentation available about the dataset
  • Load the dataset and realize an exploratory analysis
  • Prepare the dataset with the generation and selection of features and treatment of the missing values

This stage corresponds to the 1st section of the Notebook where:

  • The dataset must the described according to the documentation
  • Summarize the characteristics of the data through an exploratory analysis
  • Description of the preprocessing steps justifying the choices
  • Include graphics that represent the main characteristics of the dataset

2. Non-supervised learning

  • Utilization of the adequate visualization and dimensionality reduction technique
  • Application of clustering methods

This stage corresponds to the section 2 of the Notebook where:

  • The results must be analyzed and the procedures explain

3. Machine Learning

  • Compare the behavior of different models/methods of machine learning through the calculation of the performance metrics
  • Present the best model for the dataset

This stage correspond to the section 3 of the notebook and all the results must be reported and analyzed in a critical way

4. Deep Leaning

  • Utilization of deep learning methods similarly to the stage 3

This stage correspond to the section 4 of the notebook and must report the results and have a critical analysis.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published