Web information extraction and retrieval
The main objective of this course is to learn about information retrieval and search from large corpuses of documents. An important example of such corpus is the World wide web with billions of documents. We will learn techniques and approaches for web crawling, information retrieval, indexing, searching of hidden web, extraction of information etc. Students who will pass the course should be able to develop programs for automatic web search and structured data extraction from web pages (including search and extraction from on-line social media).
Topics
-
Web crawling, Basic architecture, Breadth-first crawlers, Preferential crawlers, Implementation details
-
Universal crawlers, Apache Nutch, Focused crawlers, Topical crawlers
-
Adaptive and intelligent crawlers, Crawler quality evaluation, Crawler Ethics and Conflicts
-
Basic concepts of IR,IR models
-
Relevance feedback, Evaluation measures, Text and web page pre-processing
-
Inverted index and compression, Latent semantic indexing
-
Web search, Meta-search, Web spamming
-
Wrapper Induction, Instance-Based Wrapper Learning
-
Automatic Wrapper Generation
-
String Matching and Tree Matching
-
Multiple Alignment, Partial Tree Alignment
-
Building DOM Trees, Extraction Based on a Single List Page or Multiple Pages
-
Web data extraction (hands-on examples)