Github - https://github.com/bbengfort
- NLP as a subset of AI
- NLTK is one of many NLP suites of libraries available
- great because it's open source (not a black box) and source can be easily be browsed (unlike many Java jars)
- Quick Python review
- google has been successful because they have had a huge training set from people clicking on the right 'answer'
- what is required?
- Domain knowledge
- A corpus in the domain
- the NLP pipleline: http://bengfort.com/presentations/discourses-in-language-processing/img/skyview/pipeline.png
- today, we are ignoring the first two columns
- morphology: the study of the forms of things, words in particular:
- orthographic rules: puppy -> puppies
- morphological rules: goose -> geese or fish
- parsing task: stemming (lemmatization) and tokenization
- tokens = symbols of language
- words = tokens with meaning
- stem = what you would look up in the dictionary
- syntax = the study of the rules for formation of sentences
- a sentence diagram: http://bengfort.com/presentations/discourses-in-language-processing/img/skyview/parse.png
- hierarchical and ordered
- S = sentence, NP, NP = noun phrase, VP = verb phrase
- a sentence diagram: http://bengfort.com/presentations/discourses-in-language-processing/img/skyview/parse.png
- semantics = the study of meaning
- Leveraging NLTK (https://github.com/nltk/nltk)
- "NLP is perfect for MapReduce" (Hadoop)
- major packages:
- Utilities:
- probability (Frequency and Conditional Distributions)
- text, data, grammar, and corpus
- tree (An impressive tree data structure and subclasses)
- draw (Visualizations in Tkinter)
- Language Processing
- tokenize, stem (Morphological Processing, Segmentation)
- collocations, models (NGram Analysis)
- tag, chunk (Tagging and named entity Recognition)
- parse (Syntactic Parsing)
- sem (Semantic Analyses)
- more: classification, clustering
- Utilities:
- built in corpora of NLTK (all can be found in
nltk.corpus
)- `gutenberg: Small selection of literature
shakespeare
: For Elizabethan comparisonswebtext
: Forums, personal ads, movie script, reviewsnps_chat
: Chat text collectd by Naval Postgrad Schoolbrown
: First million word electronic corpusreuters
: News articles (1.3 million words)inaugural
: Innaugural addressesswtichboard
: Transcribed phone conversations
- organizing your own corpus
|-- Corpus Root
+--+
|-- README
|-- categories.txt
|-- texta.txt
+-- textb.txt
- Overrides for walking directory structures
- ZipFile Corpus Readers for compression
- accessing a corpus
from nltk.corpus import PlaintextCorpusReader root = '/home/student/Corpora/books/' corpus = PlaintextCorpusReader(root, '.*') corpus.fileids() ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'An', ...] print corpus.readme() "The corpus of categorized books for the NLTK Workshop"
- beyond plaintext
- Tagged and Categoried Corpora
- HTML -
nltk.clean_html
orBeautifulSoup
- protip: use Instapaper and clean_html together to not have to reinvent body parsers for websites
- RSS -
feedparser
library - PDF -
pypdf
- MS Word -
pywin32
- segmentation = splitting raw text into sentences (or other segments)
- is simply splitting on punctuation enough?
- enter punkt sentence tokenizer
- word tokenization
- splitting on whitespace is usually not enough
- n-grams = a contiguous sequence of N items from a sequence of text
- things to look into
- what other projects like http://overview.ap.org/ exist?
- http://www.antlr.org/ - ANother Tool for Language Recognition
- NLTK being "Production ready analytics with Numpy or Pandas"
- hadoop streaming w/ NLTK
- NLP framework for Ruby - https://github.com/louismullie/treat
- health-specific! http://idash.ucsd.edu/nlp-and-data-modeling