-
Notifications
You must be signed in to change notification settings - Fork 134
FAQs
The default data format contains the following folders:
-
data_target
contains relevant pages. -
data_negative
contains irrelevant pages. In default setting, the crawler does not save the irrelevant pages. -
data_monitor
contains current status of the crawler. -
data_url
anddata_backlinks
are where the persistent storages keep data of the frontier and the crawled graph.
Unless you stop it, the crawler exists when the number of visited pages exceeds the limit in the setting, which is 9M at default. You can look at this file data_monitor/harvestinfo.csv
to know how many pages has been downloaded to decide whether you want to stop the crawler. The 1st, 2nd, 3rd columns are number of relevant pages, number of visited pages, timestamp.
The crawler will exits when the number of visited pages reaches the default setting. You can modify it by changing target_storage.visited_page_limit
key in configuration file.
In default setting, the crawler stores crawled in html format without metadata information. You can chose to store data in other data formats as described in the README by changing value of target_storage.data_format.type
key in the configuration file. Indexing web pages directly into ElasticSearch is available too. Check out the ELASTICSEARCH data format.
This is not a default setting so you need to turn this feature on by changing value of target_storage.store_negative_pages
key in in the configuration file.
Yes. ACHE does language detection and tries to crawl only pages with content in English. You can enable or disable language detection on the configuration file ache.yml
by changing the key target_storage.english_language_detection_enabled
.
Yes, we limit this number so that the crawler will not be trapped by particular domains. The default is 100, however you can change it in configuration file with link_storage.max_pages_per_domain
key.
We welcome user feedback. Please submit any suggestions or bug reports using the Github issue tracker (https://github.com/ViDA-NYU/ache/issues)