Web information extraction and retrieval
The main objective of this course is to learn about information retrieval and search from large corpuses of documents. An important example of such corpus is the World wide web with billions of documents. We will learn techniques and approaches for web crawling, information retrieval, indexing, searching of hidden web, extraction of information etc. Students who will pass the course should be able to develop programs for automatic web search and structured data extraction from web pages (including search and extraction from on-line social media).
Web crawling, Basic architecture, Breadth-first crawlers, Preferential crawlers, Implementation details
Universal crawlers, Apache Nutch, Focused crawlers, Topical crawlers
Adaptive and intelligent crawlers, Crawler quality evaluation, Crawler Ethics and Conflicts
Basic concepts of IR,IR models
Relevance feedback, Evaluation measures, Text and web page pre-processing
Inverted index and compression, Latent semantic indexing
Web search, Meta-search, Web spamming
Wrapper Induction, Instance-Based Wrapper Learning
Automatic Wrapper Generation
String Matching and Tree Matching
Multiple Alignment, Partial Tree Alignment
Building DOM Trees, Extraction Based on a Single List Page or Multiple Pages
Web data extraction (hands-on examples)