Skip to content
Aécio Santos edited this page Mar 28, 2016 · 13 revisions

What is inside the data output directory?

The default data format contains the following folders:

  • data_target contains relevant pages.
  • data_negative contains irrelevant pages. In default setting, the crawler does not save the irrelevant pages.
  • data_monitor contains current status of the crawler.
  • data_url and data_backlinks are where the persistent storages keep data of the frontier and the crawled graph.

When to stop the crawler?

Unless you stop it, the crawler exists when the number of visited pages exceeds the limit in the setting, which is 9M at default. You can look at this file data_monitor/harvestinfo.csv to know how many pages has been downloaded to decide whether you want to stop the crawler. The 1st, 2nd, 3rd columns are number of relevant pages, number of visited pages, timestamp.

How could I limit the number of visited pages?

The crawler will exits when the number of visited pages reaches the default setting. You can modify it by changing target_storage.visited_page_limit key in configuration file.

What is the format for crawled data?

In default setting, the crawler stores crawled in html format without metadata information. You can chose to store data in other data formats as described in the README by changing value of target_storage.data_format.type key in the configuration file. Indexing web pages directly into ElasticSearch is available too. Check out the ELASTICSEARCH data format.

How can I save irrelevant pages?

This is not a default setting so you need to turn this feature on by changing value of target_storage.store_negative_pages key in in the configuration file.

Does ACHE crawl webpages in other languages rather than English?

Yes. ACHE does language detection and tries to crawl only pages with content in English. You can enable or disable language detection on the configuration file ache.yml by changing the key target_storage.english_language_detection_enabled.

Is there any limit on number of crawled webpages per website?

Yes, we limit this number so that the crawler will not be trapped by particular domains. The default is 100, however you can change it in configuration file with link_storage.max_pages_per_domain key.

Where to report bug?

We welcome user feedback. Please submit any suggestions or bug reports using the Github issue tracker (https://github.com/ViDA-NYU/ache/issues)