-
Notifications
You must be signed in to change notification settings - Fork 134
FAQs
data_target
contains relevant pages.
data_negative
contains irrelevant pages. In default setting, the crawler does not save the irrelevant pages.
data_monitor
contains current status of the crawler.
data_url
and data_backlinks
are where persistent storages keep information of frontier and crawled graph.
Unless you stop it, the crawler exists when the number of visited pages exceeds the limit in the setting, which is 9M at default. You can look at this file data_monitor/harvestinfo.csv
to know how many pages has been downloaded to decide whether you want to stop the crawler. The 1st, 2nd, 3rd columns are number of relevant pages, number of visited pages, timestamp.
The crawler will exits when the number of visited pages reaches 9M in default setting. You can modify it by changing VISITED_PAGE_LIMIT key in configuration file
We welcome user feedback. Please submit any suggestions or bug reports using the Github tracker (https://github.com/ViDA-NYU/ache/issues)
In default setting, the crawler stores crawled in html format without metadata information. You can chose to store data in CBOR format by changing value of DATA_FORMAT key in the configuration file - . We are going to support dumping data directly to ElasticSearch, stay tuned!
This is not a default setting so you need to turn this feature on by changing value of SAVE_NEGATIVE_PAGES key in in the configuration file.
Yes. ACHE does language detection and tries to crawl only pages with content in English. You can enable or disable language detection on the configuration file ache.yml
by changing the key target_storage.english_language_detection_enabled
.
Yes, we limit this number so that the crawler will not be trapped by particular domains. The default is 100, however you can change it in configuration file with MAX_PAGES_PER_DOMAIN key.