From dc2e2a27848010cd02a262b2524a63dbebf8c63b Mon Sep 17 00:00:00 2001 From: GitHub Actions Date: Wed, 1 May 2024 13:39:36 +0000 Subject: [PATCH] differences for PR #8 --- 01-welcome.md | 58 +++++++++++++++++++++++ 02-introduction.md | 97 ++++++++++++++++++++++++++++++++++++++ config.yaml | 19 ++++---- episode01.md | 57 ----------------------- episode02.md | 43 ----------------- index.md | 2 +- introduction.md | 114 --------------------------------------------- md5sum.txt | 9 ++-- 8 files changed, 169 insertions(+), 230 deletions(-) create mode 100644 01-welcome.md create mode 100644 02-introduction.md delete mode 100644 episode01.md delete mode 100644 episode02.md delete mode 100644 introduction.md diff --git a/01-welcome.md b/01-welcome.md new file mode 100644 index 00000000..0eb1ddce --- /dev/null +++ b/01-welcome.md @@ -0,0 +1,58 @@ +--- +title: "Welcome" +teaching: 5 +exercises: 0 +--- + +:::::: questions +- Who is this lesson for? +- What will be covered in this lesson? +:::::: + +:::::: objectives +- Identify the target audience +- Identify the learning goals of the lesson +:::::: + +# Welcome +This is a hands-on introduction to Natural Language Processing (or NLP). NLP refers to a set of techniques involving the application of statistical methods, +with or without insights from linguistics, to understand natural (i.e, human) language for the sake of solving real-world tasks. + +This course is designed to equip researchers in the humanities and social sciences with the foundational +skills needed to carry over text-based research projects. + +## What will we be covering in this lesson? + +This lesson provides a high-level introduction to NLP with particular emphasis on applications in the humanities and the social +sciences. We will focus on solving a particular problem over the lesson, that is how to identify key info in text (such as people, +places, companies, dates and more) and labeling each one of them with the right category name. Towards the end of the lesson, +we will cover also other types of applications (such as topic modelling, and text generation). + +After following this lesson, learners will be able to: + +- Explain and differentiate what are the core topics in NLP +- Identify what kinds of tasks NLP techniques excel at, and what are their limitations +- Structure a typical NLP pipeline +- Extract vector representations of individual words, visualise and manipulate it +- Applying a machine learning algorithm to textual data to extract and categorise names of entities (e.gs., places, people) +- Apply popular tools and libraries used to solve other tasks in NLP (such as topic modelling, and text generation) + +## Software packages required +The lesson is coded entirely in Python. We are going to use Jupyter notebooks throughout the lesson and the following packages: + +- spacy +- gensim +- transformers + +## Dataset +In this lesson, we'll use N books from the [Project Gutenberg](https://www.gutenberg.org/). We will use their Plain Text UTF-8 version. + +- The Adventures of Sherlock Holmes by Arthur Conan Doyle - [Full text](https://www.gutenberg.org/cache/epub/1661/pg1661.txt) - [Wikipedia](https://en.wikipedia.org/wiki/The_Adventures_of_Sherlock_Holmes) +- The Count of Monte Cristo by Alexandre Dumas - [Full text](https://www.gutenberg.org/cache/epub/1184/pg1184.txt) - [Wikipedia](https://en.wikipedia.org/wiki/The_Count_of_Monte_Cristo) + + +:::::: keypoints +- This lesson on Natural language processing in Python is for researchers working in the field of Humanities and/or Social Sciences +- This lesson is an introduction to NLP and aims at implementing first practical NLP applications from scratch +:::::: + diff --git a/02-introduction.md b/02-introduction.md new file mode 100644 index 00000000..ed859150 --- /dev/null +++ b/02-introduction.md @@ -0,0 +1,97 @@ +--- +title: "Episode 1" +teaching: 10 +exercises: 2 +--- + +:::::: questions +- What is natural language processing (NLP)? +- Why is it important to learn about NLP? +- What are some classic tasks associated with NLP? +:::::: + +:::::: objectives +- Recognise the importance and benefits of learning about NLP +- Identify and describe classic tasks and challenges in NLP +- Explore practical applications of natural language processing in industry and research +:::::: + +## Introducing NLP + +### What is NLP? +Natural language processing (NLP) is an area of research and application that focuses on making natural (i.e., human) language accessible to computers so that they can be used to perform useful tasks (Chowdhury & Chowdhury, 2023). Research in NLP is highly interdisciplinary, drawing on concepts from computer science, linguistics, logic, mathematics, psychology, etc. In the past decade, NLP has evolved significantly with advances in technology to the point that it has become embedded in our daily lives: automatic language translation or chatGPT are only some examples. This evolution has enhanced its applications and expanded its interaction with fields like artificial intelligence, machine learning, reaching practically any other research field. + +### Why do we care? +The past decade's breakthroughs have resulted in NLP being increasingly used in a range of diverse domains such as retail (e.g., customer service chatbots), healthcare (e.g., AI-assisted hearing devices), finance (e.g., anomaly detection in monetary transactions), law (e.g., legal research), and many more. These applications are possible because NLP researchers developed (and constantly do so) tools and techniques to make computers understand and manipulate language effectively. + +With so many contributions and such impressive advances of recent years, it is an exciting time to start bringing NLP techniques in your own work. Thanks to dedicated python libraries, these tools are now more accessible. They offer modularity, allowing you to integrate them easily in your code, and scalability, i.e., capable of processing vast amounts of text efficiently. + +These tools are easily accessible via dedicated python libraries that allow for modularity (i.e., you can build upon those in your code) and scalability (i.e., you can process vast amount of text) without necessarily being an advanced python programmer. Whether dealing with text or audio, NLP tools provide a means to handle and interpret language data to meet specific needs and objectives. Even those without advanced programming skills can leverage these tools to address problems in social sciences, humanities, or any field where language plays a crucial role. In a nutshell, NLP opens up possibilities, making sophisticated techniques accessible to a broad audience. + +:::::::::::: challenge +## NLP in the real world + +Name three to five products that you use on a daily basis and that rely on NLP techniques. To solve this exercise you can get +some help from the web. + + +:::::: solution +These are some of the most popular NLP-based products that we use on a daily basis: + +- Voice-based assistants (e.g., Alexa, Siri, Cortana) +- Machine translation (e.g., Google translate, Amazon translate) +- Search engines (e.g., Google, Bing, DuckDuckGo) +- Keyboard autocompletion on smartphones +- Spam filtering +- Spell and grammar checking apps +:::::: +:::::::::::: + +## What is NLP typically good at? + +Here's a collection of fundamental tasks in NLP: + +- Text classification +- Information extraction +- NER (named entity recognition) +- Next word prediction +- Text summarization +- Question answering +- Topic modeling +- Machine translation +- Conversational agent + +In this lesson we are going to see the NER and topic modeling tasks in detail, and learn how to develop solutions that work for these particular use cases. Specifically, our goal in this lesson will be to identify characters and locations in novels, and determine what are the most relevant topics in these books. However, it is useful to have an understanding of the other tasks and its challenges. + +### Text classification + +The goal of text classification is to assign a label category to a text or a document based on its content. This task is for example used in spam filtering - is this email spam or not - and sentiment analysis; is this text positive or negative. + +### Information extraction + +With this term we refer to a collection of techniques for extracting relevant information from the text or a document and finding relationships between those. This task is useful to discover cause-effects links and populate databases. For instance, finding and classifying relations among entities mentioned in a text (e.g., X is the child of Y) or geospatial relations (e.g., Amsterdam is north of Bruxelles) + +### Named Entity Recognition (NER) + +The task of detecting names, dates, language names, events, work of arts, organisations, and many more. + +### Next word prediction + +This task involves predicting what the next word in a sentence will be based on the history of previous words. +Speech recognition, spelling correction, handwriting recognition all run an implementation of this task. + +### Text summarization + +Create short summaries of longer documents while retaining the core content. + +### Question answering + +Task of building a system that answer questions posed in natural (i.e., human) language. + +### Topic modeling + +Task of discovering topical structure in documents. + +### Machine translation + +The task of translating a piece of text from one language to another. diff --git a/config.yaml b/config.yaml index 9575d2bd..64235d07 100644 --- a/config.yaml +++ b/config.yaml @@ -18,7 +18,7 @@ created: 2023-11-16 # FIXME # Comma-separated list of keywords for the lesson keywords: 'software, data, lesson, The Carpentries, NLP, English, -social sciences, pre-alpha' # +social sciences, pre-alpha' # # Life cycle stage of the lesson # possible values: pre-alpha, alpha, beta, stable @@ -28,13 +28,13 @@ life_cycle: 'pre-alpha' # FIXME license: 'CC-BY 4.0' # Link to the source repository for this lesson -source: 'https://github.com/carpentries/workbench-template-md' # FIXME +source: 'https://github.com/esciencecenter-digital-skills/Natural-language-processing' # Default branch of your lesson branch: 'main' # Who to contact if there are any issues -contact: 'l . ootes at esciencecenter.nl' # FIXME +contact: 'l . ootes at esciencecenter.nl' # Navigation ------------------------------------------------ # @@ -59,19 +59,18 @@ contact: 'l . ootes at esciencecenter.nl' # FIXME # - another-learner.md # Order of episodes in your lesson -episodes: -- introduction.md -- episode01.md -- episode02.md +episodes: +- 01-welcome.md +- 02-introduction.md # Information for Learners -learners: +learners: # Information for Instructors -instructors: +instructors: # Learner Profiles -profiles: +profiles: # Customisation --------------------------------------------- # diff --git a/episode01.md b/episode01.md deleted file mode 100644 index 2d9e3f11..00000000 --- a/episode01.md +++ /dev/null @@ -1,57 +0,0 @@ ---- -title: "Episode 1: Apply tokenization, clean and pre-process textual data" -teaching: 0 -exercises: 4 ---- - -:::::::::::::::::::::::::::::::::::::: questions - -- What different types of preprocessing steps are there? - -:::::::::::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::::::::::: objectives - -- Apply tokenization, lower-casing and stopwords removal - -:::::::::::::::::::::::::::::::::::::::::::::::: - -::: challenge - -Before starting this exercise, a few packages have to be imported. To do this, execute the following: - -Import and download the following packages: -```python -from nltk.tokenize import word_tokenize -import nltk -from nltk.corpus import stopwords -nltk.download('stopwords') -``` - -In this exercise we will do some preprocessing on the text: -"Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics." - -- As a first step; apply lower casing on the given text. - -:::::: solution - -Then, lower case the text: - -```python -text = "Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics." -text.lower() -``` -:::::: -::: - -::: challenge -A second step in preprocessing the text, apply tokenisation on the lower-cased text. -If you do not have the lower-cased text available, you can use the input text. - -:::::: solution - -```python -words = word_tokenize(text) -``` -:::::: -::: \ No newline at end of file diff --git a/episode02.md b/episode02.md deleted file mode 100644 index 9d6e7768..00000000 --- a/episode02.md +++ /dev/null @@ -1,43 +0,0 @@ ---- -title: "Episode 2" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::::::::::: questions - -- How to I preprocess my text data? -- What is a vector space? - -:::::::::::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::::::::::: objectives - -- Explain how to use markdown with The Carpentries Workbench -- Demonstrate how to include pieces of code, figures, and nested challenge blocks - -:::::::::::::::::::::::::::::::::::::::::::::::: - -> ## Learning Objectives - - - - - - - - -from doc: -> After following this lesson, learners will be able to: -> - Apply tokenization and lemmatization techniques on a specific test case -> - Clean and pre-process textual data (lower-case text, remove stop-words) -> - Recall what different preprocessing steps there are -> - Explain what a vector space is -> - Explain what the cosinee similarity is and compute it -> - Plot word embeddings -> - Explain what document embedding and TF-IDF is -> - Explain how a word2vec model works -> - Train a word2vec model -> - Explain the difference between GPT and BERT -> - use GPT2Tokenizer from the library transformers -> - Explain the difference between word token embeddings vs word position embeddings diff --git a/index.md b/index.md index f02f7e91..ccb272a1 100644 --- a/index.md +++ b/index.md @@ -13,4 +13,4 @@ Before joining this course, participants should have: - Fundamentals of NLP: Introduce terminology, basic concepts and possible applications of NLP. - Data acquisition and Pre-processing: Preprocessing techniques such as tokenization, stemming, lemmatization, and removing stop words -- Text analysis and feature extraction: Extract features from text, including TF-IDF and word embeddings \ No newline at end of file +- Text analysis and feature extraction: Extract features from text, including TF-IDF and word embeddings diff --git a/introduction.md b/introduction.md deleted file mode 100644 index 7065d231..00000000 --- a/introduction.md +++ /dev/null @@ -1,114 +0,0 @@ ---- -title: "Using Markdown" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::::::::::: questions - -- How do you write a lesson using Markdown and `{sandpaper}`? - -:::::::::::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::::::::::: objectives - -- Explain how to use markdown with The Carpentries Workbench -- Demonstrate how to include pieces of code, figures, and nested challenge blocks - -:::::::::::::::::::::::::::::::::::::::::::::::: - -## Introduction - -This is a lesson created via The Carpentries Workbench. It is written in -[Pandoc-flavored Markdown](https://pandoc.org/MANUAL.txt) for static files and -[R Markdown][r-markdown] for dynamic files that can render code into output. -Please refer to the [Introduction to The Carpentries -Workbench](https://carpentries.github.io/sandpaper-docs/) for full documentation. - -What you need to know is that there are three sections required for a valid -Carpentries lesson: - - 1. `questions` are displayed at the beginning of the episode to prime the - learner for the content. - 2. `objectives` are the learning objectives for an episode displayed with - the questions. - 3. `keypoints` are displayed at the end of the episode to reinforce the - objectives. - -:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor - -Inline instructor notes can help inform instructors of timing challenges -associated with the lessons. They appear in the "Instructor View" - -:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::::::::::: challenge - -## Challenge 1: Can you do it? - -What is the output of this command? - -```r -paste("This", "new", "lesson", "looks", "good") -``` - -:::::::::::::::::::::::: solution - -## Output - -```output -[1] "This new lesson looks good" -``` - -::::::::::::::::::::::::::::::::: - - -## Challenge 2: how do you nest solutions within challenge blocks? - -:::::::::::::::::::::::: solution - -You can add a line with at least three colons and a `solution` tag. - -::::::::::::::::::::::::::::::::: -:::::::::::::::::::::::::::::::::::::::::::::::: - -## Figures - -You can use standard markdown for static figures with the following syntax: - -`![optional caption that appears below the figure](figure url){alt='alt text for -accessibility purposes'}` - -![You belong in The Carpentries!](https://raw.githubusercontent.com/carpentries/logo/master/Badge_Carpentries.svg){alt='Blue Carpentries hex person logo with no text.'} - -::::::::::::::::::::::::::::::::::::: callout - -Callout sections can highlight information. - -They are sometimes used to emphasise particularly important points -but are also used in some lessons to present "asides": -content that is not central to the narrative of the lesson, -e.g. by providing the answer to a commonly-asked question. - -:::::::::::::::::::::::::::::::::::::::::::::::: - - -## Math - -One of our episodes contains $\LaTeX$ equations when describing how to create -dynamic reports with {knitr}, so we now use mathjax to describe this: - -`$\alpha = \dfrac{1}{(1 - \beta)^2}$` becomes: $\alpha = \dfrac{1}{(1 - \beta)^2}$ - -Cool, right? - -::::::::::::::::::::::::::::::::::::: keypoints - -- Use `.md` files for episodes when you want static content -- Use `.Rmd` files for episodes when you need to generate output -- Run `sandpaper::check_lesson()` to identify any issues with your lesson -- Run `sandpaper::build_lesson()` to preview your lesson locally - -:::::::::::::::::::::::::::::::::::::::::::::::: - -[r-markdown]: https://rmarkdown.rstudio.com/ diff --git a/md5sum.txt b/md5sum.txt index 37d186fc..c6a04a8a 100644 --- a/md5sum.txt +++ b/md5sum.txt @@ -1,12 +1,11 @@ "file" "checksum" "built" "date" "CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2022-08-05" "LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2023-04-07" -"config.yaml" "d1f7c7f1ab6a1fff354bec6a6eb5006e" "site/built/config.yaml" "2023-11-16" -"index.md" "4a19341ab14d2faa6a43c57bcc39c16b" "site/built/index.md" "2023-11-16" +"config.yaml" "5add325162bfca643b538708d8f2dcb6" "site/built/config.yaml" "2024-05-01" +"index.md" "7cce00f7cd30382ee6c4812fe539c1de" "site/built/index.md" "2024-05-01" "links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2022-04-22" -"episodes/introduction.md" "6c55d31b41d322729fb3276f8d4371fc" "site/built/introduction.md" "2023-07-24" -"episodes/episode01.md" "97e029136e304c0497783517fc72d013" "site/built/episode01.md" "2023-11-28" -"episodes/episode02.md" "43ca8a61a3196b32ffb901d555b25388" "site/built/episode02.md" "2023-11-16" +"episodes/01-welcome.md" "f0ac201161feaa49017b7bded1817779" "site/built/01-welcome.md" "2024-05-01" +"episodes/02-introduction.md" "34dda3d102a843afc5237a4e64929bcd" "site/built/02-introduction.md" "2024-05-01" "instructors/instructor-notes.md" "cae72b6712578d74a49fea7513099f8c" "site/built/instructor-notes.md" "2023-03-16" "learners/reference.md" "1c7cc4e229304d9806a13f69ca1b8ba4" "site/built/reference.md" "2023-03-16" "learners/setup.md" "61568b36c8b96363218c9736f6aee03a" "site/built/setup.md" "2023-04-07"