In this session, we will cover some basic features from the Entrez module embedded within the BioPython package. The session will introduce you to scripting automated searches of the NCBI Pubmed database as well as some approaches to exploring rudimentary ways of analysing text data. Anyone who has attended both of our previous Python workshops will have all the necessary background to complete this session. If you have not been able to make our previous Python sessions, all the Jupyter notebooks from them are posted on repositories within the IC-Computational-Biology-Society organisation.
NB: This session does not cover natural language processing or topics in machine learning. Nevertheless, it should give you the foundation to begin an investigation that culminates in the use of dedicated Python packages, such as NLTK. By the end of the session, you should be able to construct your own dataset of NCBI Pubmed text data on which to (potentially) start training machine learning models.
If you are attending our virtual interactive session on Microsoft Teams, please make sure you can run Anaconda, which can be easily obtained from Imperial College's AppsAnywhere platform or from the offical Anaconda website (only recommended if you cannot access AppsAnywhere or are completing the tutorial outside of the scheduled session).
This tutorial is intended for educational use. If you would like to use any material herein for teaching or ulterior purposes outside the remit of the Imperial College Computational Biology Society, please contact the referenced authors.
Joseph I. J. Ellaway
Email: [email protected]