Skip to content

Commit

Permalink
differences for PR #8
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed May 1, 2024
1 parent 2bcb58a commit dc2e2a2
Show file tree
Hide file tree
Showing 8 changed files with 169 additions and 230 deletions.
58 changes: 58 additions & 0 deletions 01-welcome.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
title: "Welcome"
teaching: 5
exercises: 0
---

:::::: questions
- Who is this lesson for?
- What will be covered in this lesson?
::::::

:::::: objectives
- Identify the target audience
- Identify the learning goals of the lesson
::::::

# Welcome
This is a hands-on introduction to Natural Language Processing (or NLP). NLP refers to a set of techniques involving the application of statistical methods,
with or without insights from linguistics, to understand natural (i.e, human) language for the sake of solving real-world tasks.

This course is designed to equip researchers in the humanities and social sciences with the foundational
skills needed to carry over text-based research projects.

## What will we be covering in this lesson?

This lesson provides a high-level introduction to NLP with particular emphasis on applications in the humanities and the social
sciences. We will focus on solving a particular problem over the lesson, that is how to identify key info in text (such as people,
places, companies, dates and more) and labeling each one of them with the right category name. Towards the end of the lesson,
we will cover also other types of applications (such as topic modelling, and text generation).

After following this lesson, learners will be able to:

- Explain and differentiate what are the core topics in NLP
- Identify what kinds of tasks NLP techniques excel at, and what are their limitations
- Structure a typical NLP pipeline
- Extract vector representations of individual words, visualise and manipulate it
- Applying a machine learning algorithm to textual data to extract and categorise names of entities (e.gs., places, people)
- Apply popular tools and libraries used to solve other tasks in NLP (such as topic modelling, and text generation)

## Software packages required
The lesson is coded entirely in Python. We are going to use Jupyter notebooks throughout the lesson and the following packages:

- spacy
- gensim
- transformers

## Dataset
In this lesson, we'll use N books from the [Project Gutenberg](https://www.gutenberg.org/). We will use their Plain Text UTF-8 version.

- The Adventures of Sherlock Holmes by Arthur Conan Doyle - [Full text](https://www.gutenberg.org/cache/epub/1661/pg1661.txt) - [Wikipedia](https://en.wikipedia.org/wiki/The_Adventures_of_Sherlock_Holmes)
- The Count of Monte Cristo by Alexandre Dumas - [Full text](https://www.gutenberg.org/cache/epub/1184/pg1184.txt) - [Wikipedia](https://en.wikipedia.org/wiki/The_Count_of_Monte_Cristo)


:::::: keypoints
- This lesson on Natural language processing in Python is for researchers working in the field of Humanities and/or Social Sciences
- This lesson is an introduction to NLP and aims at implementing first practical NLP applications from scratch
::::::

97 changes: 97 additions & 0 deletions 02-introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
title: "Episode 1"
teaching: 10
exercises: 2
---

:::::: questions
- What is natural language processing (NLP)?
- Why is it important to learn about NLP?
- What are some classic tasks associated with NLP?
::::::

:::::: objectives
- Recognise the importance and benefits of learning about NLP
- Identify and describe classic tasks and challenges in NLP
- Explore practical applications of natural language processing in industry and research
::::::

## Introducing NLP

### What is NLP?
Natural language processing (NLP) is an area of research and application that focuses on making natural (i.e., human) language accessible to computers so that they can be used to perform useful tasks (Chowdhury & Chowdhury, 2023). Research in NLP is highly interdisciplinary, drawing on concepts from computer science, linguistics, logic, mathematics, psychology, etc. In the past decade, NLP has evolved significantly with advances in technology to the point that it has become embedded in our daily lives: automatic language translation or chatGPT are only some examples. This evolution has enhanced its applications and expanded its interaction with fields like artificial intelligence, machine learning, reaching practically any other research field.

### Why do we care?
The past decade's breakthroughs have resulted in NLP being increasingly used in a range of diverse domains such as retail (e.g., customer service chatbots), healthcare (e.g., AI-assisted hearing devices), finance (e.g., anomaly detection in monetary transactions), law (e.g., legal research), and many more. These applications are possible because NLP researchers developed (and constantly do so) tools and techniques to make computers understand and manipulate language effectively.

With so many contributions and such impressive advances of recent years, it is an exciting time to start bringing NLP techniques in your own work. Thanks to dedicated python libraries, these tools are now more accessible. They offer modularity, allowing you to integrate them easily in your code, and scalability, i.e., capable of processing vast amounts of text efficiently.

These tools are easily accessible via dedicated python libraries that allow for modularity (i.e., you can build upon those in your code) and scalability (i.e., you can process vast amount of text) without necessarily being an advanced python programmer. Whether dealing with text or audio, NLP tools provide a means to handle and interpret language data to meet specific needs and objectives. Even those without advanced programming skills can leverage these tools to address problems in social sciences, humanities, or any field where language plays a crucial role. In a nutshell, NLP opens up possibilities, making sophisticated techniques accessible to a broad audience.

:::::::::::: challenge
## NLP in the real world

Name three to five products that you use on a daily basis and that rely on NLP techniques. To solve this exercise you can get
some help from the web.


:::::: solution
These are some of the most popular NLP-based products that we use on a daily basis:

- Voice-based assistants (e.g., Alexa, Siri, Cortana)
- Machine translation (e.g., Google translate, Amazon translate)
- Search engines (e.g., Google, Bing, DuckDuckGo)
- Keyboard autocompletion on smartphones
- Spam filtering
- Spell and grammar checking apps
::::::
::::::::::::

## What is NLP typically good at?

Here's a collection of fundamental tasks in NLP:

- Text classification
- Information extraction
- NER (named entity recognition)
- Next word prediction
- Text summarization
- Question answering
- Topic modeling
- Machine translation
- Conversational agent

In this lesson we are going to see the NER and topic modeling tasks in detail, and learn how to develop solutions that work for these particular use cases. Specifically, our goal in this lesson will be to identify characters and locations in novels, and determine what are the most relevant topics in these books. However, it is useful to have an understanding of the other tasks and its challenges.

### Text classification

The goal of text classification is to assign a label category to a text or a document based on its content. This task is for example used in spam filtering - is this email spam or not - and sentiment analysis; is this text positive or negative.

### Information extraction

With this term we refer to a collection of techniques for extracting relevant information from the text or a document and finding relationships between those. This task is useful to discover cause-effects links and populate databases. For instance, finding and classifying relations among entities mentioned in a text (e.g., X is the child of Y) or geospatial relations (e.g., Amsterdam is north of Bruxelles)

### Named Entity Recognition (NER)

The task of detecting names, dates, language names, events, work of arts, organisations, and many more.

### Next word prediction

This task involves predicting what the next word in a sentence will be based on the history of previous words.
Speech recognition, spelling correction, handwriting recognition all run an implementation of this task.

### Text summarization

Create short summaries of longer documents while retaining the core content.

### Question answering

Task of building a system that answer questions posed in natural (i.e., human) language.

### Topic modeling

Task of discovering topical structure in documents.

### Machine translation

The task of translating a piece of text from one language to another.
19 changes: 9 additions & 10 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ created: 2023-11-16 # FIXME

# Comma-separated list of keywords for the lesson
keywords: 'software, data, lesson, The Carpentries, NLP, English,
social sciences, pre-alpha' #
social sciences, pre-alpha' #

# Life cycle stage of the lesson
# possible values: pre-alpha, alpha, beta, stable
Expand All @@ -28,13 +28,13 @@ life_cycle: 'pre-alpha' # FIXME
license: 'CC-BY 4.0'

# Link to the source repository for this lesson
source: 'https://github.com/carpentries/workbench-template-md' # FIXME
source: 'https://github.com/esciencecenter-digital-skills/Natural-language-processing'

# Default branch of your lesson
branch: 'main'

# Who to contact if there are any issues
contact: 'l . ootes at esciencecenter.nl' # FIXME
contact: 'l . ootes at esciencecenter.nl'

# Navigation ------------------------------------------------
#
Expand All @@ -59,19 +59,18 @@ contact: 'l . ootes at esciencecenter.nl' # FIXME
# - another-learner.md

# Order of episodes in your lesson
episodes:
- introduction.md
- episode01.md
- episode02.md
episodes:
- 01-welcome.md
- 02-introduction.md

# Information for Learners
learners:
learners:

# Information for Instructors
instructors:
instructors:

# Learner Profiles
profiles:
profiles:

# Customisation ---------------------------------------------
#
Expand Down
57 changes: 0 additions & 57 deletions episode01.md

This file was deleted.

43 changes: 0 additions & 43 deletions episode02.md

This file was deleted.

2 changes: 1 addition & 1 deletion index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ Before joining this course, participants should have:

- Fundamentals of NLP: Introduce terminology, basic concepts and possible applications of NLP.
- Data acquisition and Pre-processing: Preprocessing techniques such as tokenization, stemming, lemmatization, and removing stop words
- Text analysis and feature extraction: Extract features from text, including TF-IDF and word embeddings
- Text analysis and feature extraction: Extract features from text, including TF-IDF and word embeddings
Loading

0 comments on commit dc2e2a2

Please sign in to comment.