Web Information Extraction and Retrieval

Web information extraction and retrieval

The main objective of this course is to learn about information retrieval and search from large corpuses of documents. An important example of such corpus is the World wide web with billions of documents. We will learn techniques and approaches for web crawling, information retrieval, indexing, searching of hidden web, extraction of information etc. Students who will pass the course should be able to develop programs for automatic web search and structured data extraction from web pages (including search and extraction from on-line social media).

Topics

Web crawling, Basic architecture, Breadth-first crawlers, Preferential crawlers, Implementation details
Universal crawlers, Apache Nutch, Focused crawlers, Topical crawlers
Adaptive and intelligent crawlers, Crawler quality evaluation, Crawler Ethics and Conflicts
Basic concepts of IR,IR models
Relevance feedback, Evaluation measures, Text and web page pre-processing
Inverted index and compression, Latent semantic indexing
Web search, Meta-search, Web spamming
Wrapper Induction, Instance-Based Wrapper Learning
Automatic Wrapper Generation
String Matching and Tree Matching
Multiple Alignment, Partial Tree Alignment
Building DOM Trees, Extraction Based on a Single List Page or Multiple Pages
Web data extraction (hands-on examples)

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
pa1		pa1
pa2		pa2
pa3		pa3
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Information Extraction and Retrieval

Programmig assigment instructions

About

Releases

Packages

Contributors 2

Languages

shanji97/Web-Information-Extraction-and-Retrieval

Folders and files

Latest commit

History

Repository files navigation

Web Information Extraction and Retrieval

Programmig assigment instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages