Skip to content
Kien Pham edited this page Apr 1, 2015 · 13 revisions

What is inside the data output directory?

data_target contains relevant pages.

data_negative contains irrelevant pages. In default setting, the crawler does not save the irrelevant pages.

data_monitor contains current status of the crawler.

data_url and data_backlinks are where persistent storages keep information of frontier and crawled graph.

When to stop the crawler?

Unless you stop it, the crawler exists when the number of visited pages exceeds the limit in the setting, which is 9M at default. You can look at this file data_monitor/harvestinfo.csv to know how many pages has been downloaded to decide whether you want to stop the crawler. The 1st, 2nd, 3rd columns are number of relevant pages, number of visited pages, timestamp.

How could I limit the number of visited pages?

The crawler will exits when the number of visited pages reaches 9M in default setting. You can modify it by changing VISITED_PAGE_LIMIT key in configuration file

Where to report bug?

We are welcome user to report any issue related to ACHE. Here is a guidline to use Github's tracker - Issues: https://guides.github.com/features/issues/

What is the format for crawled data?

In default setting, the crawler stores crawled in html format without metadata information. You can chose to store data in CBOR format by changing value of DATA_FORMAT key in the configuration file - . We are going to support dumping data directly to ElasticSearch, stay tuned!

How can I save irrelevant pages?

This is not a default setting so you need to turn this feature on by changing value of SAVE_NEGATIVE_PAGES key in in the configuration file.

Does ache crawl webpages in other languages rather than English?

No. Any page not in English is considered as irrelevant.

Is there any limit on number of crawled webpages per website?

Yes, we limit this number so that the crawler will not be trapped by particular domains. The default is 100, however you can change it in configuration file with MAX_PAGES_PER_DOMAIN key.