generated from carpentries/workbench-template-md
-
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Auto-generated via {sandpaper} Source : ab05b04 Branch : main Author : Eva Viviani <[email protected]> Time : 2024-05-06 08:29:05 +0000 Message : Merge pull request #8 from esciencecenter-digital-skills/episode-1-develop Episode 1 develop
- Loading branch information
1 parent
728f100
commit 6eee140
Showing
7 changed files
with
169 additions
and
185 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
--- | ||
title: "Welcome" | ||
teaching: 5 | ||
exercises: 0 | ||
--- | ||
|
||
:::::: questions | ||
- Who is this lesson for? | ||
- What will be covered in this lesson? | ||
:::::: | ||
|
||
:::::: objectives | ||
- Identify the target audience | ||
- Identify the learning goals of the lesson | ||
:::::: | ||
|
||
# Welcome | ||
This is a hands-on introduction to Natural Language Processing (or NLP). NLP refers to a set of techniques involving the application of statistical methods, | ||
with or without insights from linguistics, to understand natural (i.e, human) language for the sake of solving real-world tasks. | ||
|
||
This course is designed to equip researchers in the humanities and social sciences with the foundational | ||
skills needed to carry over text-based research projects. | ||
|
||
## What will we be covering in this lesson? | ||
|
||
This lesson provides a high-level introduction to NLP with particular emphasis on applications in the humanities and the social | ||
sciences. We will focus on solving a particular problem over the lesson, that is how to identify key entities in text (such as people, | ||
places, companies, dates and more) and labeling each one of them with the right category name. Towards the end of the lesson, | ||
we will cover also other types of applications (such as topic modelling, and text generation). | ||
|
||
After following this lesson, learners will be able to: | ||
|
||
- Explain and differentiate what are the core topics in NLP | ||
- Identify what kinds of tasks NLP techniques excel at, and what are their limitations | ||
- Structure a typical NLP pipeline | ||
- Extract vector representations of individual words, visualise and manipulate it | ||
- Applying a machine learning algorithm to textual data to extract and categorise names of entities (e.gs., places, people) | ||
- Apply popular tools and libraries used to solve other tasks in NLP (such as topic modelling, and text generation) | ||
|
||
## Software packages required | ||
The lesson is coded entirely in Python. We are going to use Jupyter notebooks throughout the lesson and the following packages: | ||
|
||
- spacy | ||
- gensim | ||
- transformers | ||
|
||
## Dataset | ||
In this lesson, we'll use N books from the [Project Gutenberg](https://www.gutenberg.org/). We will use their Plain Text UTF-8 version. | ||
|
||
- The Adventures of Sherlock Holmes by Arthur Conan Doyle - [Full text](https://www.gutenberg.org/cache/epub/1661/pg1661.txt) - [Wikipedia](https://en.wikipedia.org/wiki/The_Adventures_of_Sherlock_Holmes) | ||
- The Count of Monte Cristo by Alexandre Dumas - [Full text](https://www.gutenberg.org/cache/epub/1184/pg1184.txt) - [Wikipedia](https://en.wikipedia.org/wiki/The_Count_of_Monte_Cristo) | ||
|
||
|
||
:::::: keypoints | ||
- This lesson on Natural language processing in Python is for researchers working in the field of Humanities and/or Social Sciences | ||
- This lesson is an introduction to NLP and aims at implementing first practical NLP applications from scratch | ||
:::::: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
--- | ||
title: "Episode 1" | ||
teaching: 10 | ||
exercises: 2 | ||
--- | ||
|
||
:::::: questions | ||
- What is natural language processing (NLP)? | ||
- Why is it important to learn about NLP? | ||
- What are some classic tasks associated with NLP? | ||
:::::: | ||
|
||
:::::: objectives | ||
- Recognise the importance and benefits of learning about NLP | ||
- Identify and describe classic tasks and challenges in NLP | ||
- Explore practical applications of natural language processing in industry and research | ||
:::::: | ||
|
||
## Introducing NLP | ||
|
||
### What is NLP? | ||
Natural language processing (NLP) is an area of research and application that focuses on making natural (i.e., human) language accessible to computers so that they can be used to perform useful tasks (Chowdhury & Chowdhury, 2023). Research in NLP is highly interdisciplinary, drawing on concepts from computer science, linguistics, logic, mathematics, psychology, etc. In the past decade, NLP has evolved significantly with advances in technology to the point that it has become embedded in our daily lives: automatic language translation or chatGPT are only some examples. This evolution has enhanced its applications and expanded its interaction with fields like artificial intelligence, machine learning, reaching practically any other research field. | ||
|
||
### Why do we care? | ||
The past decade's breakthroughs have resulted in NLP being increasingly used in a range of diverse domains such as retail (e.g., customer service chatbots), healthcare (e.g., AI-assisted hearing devices), finance (e.g., anomaly detection in monetary transactions), law (e.g., legal research), and many more. These applications are possible because NLP researchers developed (and constantly do so) tools and techniques to make computers understand and manipulate language effectively. | ||
|
||
With so many contributions and such impressive advances of recent years, it is an exciting time to start bringing NLP techniques in your own work. Thanks to dedicated python libraries, these tools are now more accessible. They offer modularity, allowing you to integrate them easily in your code, and scalability, i.e., capable of processing vast amounts of text efficiently. | ||
|
||
These tools are easily accessible via dedicated python libraries that allow for modularity (i.e., you can build upon those in your code) and scalability (i.e., you can process vast amount of text) without necessarily being an advanced python programmer. Whether dealing with text or audio, NLP tools provide a means to handle and interpret language data to meet specific needs and objectives. Even those without advanced programming skills can leverage these tools to address problems in social sciences, humanities, or any field where language plays a crucial role. In a nutshell, NLP opens up possibilities, making sophisticated techniques accessible to a broad audience. | ||
|
||
:::::::::::: challenge | ||
## NLP in the real world | ||
|
||
Name three to five products that you use on a daily basis and that rely on NLP techniques. To solve this exercise you can get | ||
some help from the web. | ||
|
||
|
||
:::::: solution | ||
These are some of the most popular NLP-based products that we use on a daily basis: | ||
|
||
- Voice-based assistants (e.g., Alexa, Siri, Cortana) | ||
- Machine translation (e.g., Google translate, Amazon translate) | ||
- Search engines (e.g., Google, Bing, DuckDuckGo) | ||
- Keyboard autocompletion on smartphones | ||
- Spam filtering | ||
- Spell and grammar checking apps | ||
:::::: | ||
:::::::::::: | ||
|
||
## What is NLP typically good at? | ||
|
||
Here's a collection of fundamental tasks in NLP: | ||
|
||
- Text classification | ||
- Information extraction | ||
- NER (named entity recognition) | ||
- Next word prediction | ||
- Text summarization | ||
- Question answering | ||
- Topic modeling | ||
- Machine translation | ||
- Conversational agent | ||
|
||
In this lesson we are going to see the NER and topic modeling tasks in detail, and learn how to develop solutions that work for these particular use cases. Specifically, our goal in this lesson will be to identify characters and locations in novels, and determine what are the most relevant topics in these books. However, it is useful to have an understanding of the other tasks and its challenges. | ||
|
||
### Text classification | ||
|
||
The goal of text classification is to assign a label category to a text or a document based on its content. This task is for example used in spam filtering - is this email spam or not - and sentiment analysis; is this text positive or negative. | ||
|
||
### Information extraction | ||
|
||
With this term we refer to a collection of techniques for extracting relevant information from the text or a document and finding relationships between those. This task is useful to discover cause-effects links and populate databases. For instance, finding and classifying relations among entities mentioned in a text (e.g., X is the child of Y) or geospatial relations (e.g., Amsterdam is north of Bruxelles) | ||
|
||
### Named Entity Recognition (NER) | ||
|
||
The task of detecting names, dates, language names, events, work of arts, countries, organisations, and many more. | ||
|
||
### Next word prediction | ||
|
||
This task involves predicting what the next word in a sentence will be based on the history of previous words. | ||
Speech recognition, spelling correction, handwriting recognition all run an implementation of this task. | ||
|
||
### Text summarization | ||
|
||
Create short summaries of longer documents while retaining the core content. | ||
|
||
### Question answering | ||
|
||
Task of building a system that answer questions posed in natural (i.e., human) language. For example, many websites nowadays offer customer service in the form of a chatbot. | ||
|
||
### Topic modeling | ||
|
||
Task of discovering topical structure in documents. Topics describe the content of a document, for instance, the output of a topic model run on a document narrating the events of the WWII might result in topics covering: the war, troops, geographical locations, weapons, etc. | ||
|
||
### Machine translation | ||
|
||
The task of translating a piece of text from one language to another. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.