Skip to content
Aécio Santos edited this page Mar 28, 2016 · 13 revisions

What is inside the data output directory?

The default data format contains the following folders:

  • data_target contains relevant pages.
  • data_negative contains irrelevant pages. In default setting, the crawler does not save the irrelevant pages.
  • data_monitor contains current status of the crawler.
  • data_url and data_backlinks are where the persistent storages keep data of the frontier and the crawled graph.

When to stop the crawler?

Unless you stop it, the crawler exists when the number of visited pages exceeds the limit in the setting, which is 9M at default. You can look at this file data_monitor/harvestinfo.csv to know how many pages has been downloaded to decide whether you want to stop the crawler. The 1st, 2nd, 3rd columns are number of relevant pages, number of visited pages, timestamp.

How could I limit the number of visited pages?

The crawler will exits when the number of visited pages reaches 9M in default setting. You can modify it by changing VISITED_PAGE_LIMIT key in configuration file

Where to report bug?

We welcome user feedback. Please submit any suggestions or bug reports using the Github tracker (https://github.com/ViDA-NYU/ache/issues)

What is the format for crawled data?

In default setting, the crawler stores crawled in html format without metadata information. You can chose to store data in CBOR format by changing value of DATA_FORMAT key in the configuration file - . We are going to support dumping data directly to ElasticSearch, stay tuned!

How can I save irrelevant pages?

This is not a default setting so you need to turn this feature on by changing value of SAVE_NEGATIVE_PAGES key in in the configuration file.

Does ACHE crawl webpages in other languages rather than English?

Yes. ACHE does language detection and tries to crawl only pages with content in English. You can enable or disable language detection on the configuration file ache.yml by changing the key target_storage.english_language_detection_enabled.

Is there any limit on number of crawled webpages per website?

Yes, we limit this number so that the crawler will not be trapped by particular domains. The default is 100, however you can change it in configuration file with MAX_PAGES_PER_DOMAIN key.