include_urls doesn't work #130

viktor-svirsky · 2017-08-13T09:23:19Z

Hi guys,

I have next config for crawler:

{ _index: ".river_web", _type: "config", _id: "http-fesscodelibsorg_web", _version: 1, found: true, _source: { index: "http-fesscodelibsorg", type: "http-fesscodelibsorg_web", urls: [ "http://fess.codelibs.org/" ], include_urls: [ "http://fess.codelibs.org/11.2/install/.*" ], max_depth: 10, max_access_count: 10, num_of_thread: 5, interval: 1000, robots_txt: true, target: [ { pattern: { url: "http://fess.codelibs.org/.*", mimeType: "text/html" }, properties: { title: { text: "title" }, body: { text: "body" } } } ] } }

where URLs are ["http://fess.codelibs.org/"] and include_urls are ["http://fess.codelibs.org/11.2/install/."], in my understanding crawler, should start to its work from http://fess.codelibs.org/ and indexes results which matched with http://fess.codelibs.org/11.2/install/. pattern. However, I get no results.

I checked the source, the pages are presented.

Please advise, what I do wrong.

The text was updated successfully, but these errors were encountered:

marevol · 2017-08-13T12:40:34Z

http://fess.codelibs.org/11.2/install/.* does not match http://fess.codelibs.org/.
So, include_urls needs to contain http://fess.codelibs.org/.

viktor-svirsky · 2017-08-15T10:29:48Z

Thanks for the clarification. However, I have an additional question:

Do include_urls and exclude_urls have a regexp format or some special syntax?

I have a case when I need to avoid next type of URLs:
https://hostname.com/?printer=1, it's a duplicate of page https://hostname.com/. All pages have this special argument (printer=1) for printing and I want to avoid index this type of pages. An additional example is https://hostname.com/ and https://hostname.com/index.html, we prefer to exclude pages index.html and index.php pages.

Thanks

marevol · 2017-08-15T14:03:19Z

Java regex format.

marevol added the question label Aug 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

include_urls doesn't work #130

include_urls doesn't work #130

viktor-svirsky commented Aug 13, 2017

marevol commented Aug 13, 2017

viktor-svirsky commented Aug 15, 2017 •

edited

Loading

marevol commented Aug 15, 2017

include_urls doesn't work #130

include_urls doesn't work #130

Comments

viktor-svirsky commented Aug 13, 2017

marevol commented Aug 13, 2017

viktor-svirsky commented Aug 15, 2017 • edited Loading

marevol commented Aug 15, 2017

viktor-svirsky commented Aug 15, 2017 •

edited

Loading