Skip to content
This repository has been archived by the owner on Nov 14, 2019. It is now read-only.

include_urls doesn't work #130

Open
viktor-svirsky opened this issue Aug 13, 2017 · 3 comments
Open

include_urls doesn't work #130

viktor-svirsky opened this issue Aug 13, 2017 · 3 comments
Labels

Comments

@viktor-svirsky
Copy link

Hi guys,

I have next config for crawler:

{ _index: ".river_web", _type: "config", _id: "http-fesscodelibsorg_web", _version: 1, found: true, _source: { index: "http-fesscodelibsorg", type: "http-fesscodelibsorg_web", urls: [ "http://fess.codelibs.org/" ], include_urls: [ "http://fess.codelibs.org/11.2/install/.*" ], max_depth: 10, max_access_count: 10, num_of_thread: 5, interval: 1000, robots_txt: true, target: [ { pattern: { url: "http://fess.codelibs.org/.*", mimeType: "text/html" }, properties: { title: { text: "title" }, body: { text: "body" } } } ] } }

where URLs are ["http://fess.codelibs.org/"] and include_urls are ["http://fess.codelibs.org/11.2/install/."], in my understanding crawler, should start to its work from http://fess.codelibs.org/ and indexes results which matched with http://fess.codelibs.org/11.2/install/. pattern. However, I get no results.

I checked the source, the pages are presented.

Please advise, what I do wrong.

@marevol
Copy link
Contributor

marevol commented Aug 13, 2017

http://fess.codelibs.org/11.2/install/.* does not match http://fess.codelibs.org/.
So, include_urls needs to contain http://fess.codelibs.org/.

@viktor-svirsky
Copy link
Author

viktor-svirsky commented Aug 15, 2017

Thanks for the clarification. However, I have an additional question:

Do include_urls and exclude_urls have a regexp format or some special syntax?

I have a case when I need to avoid next type of URLs:
https://hostname.com/?printer=1, it's a duplicate of page https://hostname.com/. All pages have this special argument (printer=1) for printing and I want to avoid index this type of pages. An additional example is https://hostname.com/ and https://hostname.com/index.html, we prefer to exclude pages index.html and index.php pages.

Thanks

@marevol
Copy link
Contributor

marevol commented Aug 15, 2017

Java regex format.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants