#WebScraper

##Config file specification

Main body consists of one json object with list of pages as properties:

page_name - name of your choice for this type of page

{
  %page_name%: {
    ...
  },
  ...
}

Each page property can have 5 items:

test - string of conditions which identify this page type; make sure that these conditions satisfy only one page type
pageLinks - list of children pages with a selector to element from which url can be extracted
properties - list of properties to be extracted from this page type with a selector to wanted element
languages - list of links to this page in other languages
pagination - selector to element containing next page link if page has pagination

"test": [%condition%,...],
"pageLinks":{
  %name%: %selector%,
  ...
},
"properties":{
  %name%: %extractor%:%selector%
},
"languages":{
  %lang_identifier%: %selector%
},
"pagination":%selector%

Conditions:

Access page object:

doc

Get element using Css selector or XPath:

.Css(...)
.XPath(...)

Access element inner text and work with it:

.InnerText = "..."

InnerText is of type string, so you can access/call string properties/methods:

.InnerText.StartsWith(...)

Check url if it contains value:

doc.UrlContains(...)

Check page language:

doc.Language = ...

Extractors:

innertext extractor - extracts inner text only
innerhtml extractor - extracts inner html of an element; all content of and element (e.g. images) is downloaded and path to local temporary folder is put in href place
outerhtml extractor - extracts outer html of an element; all content of and element (e.g. images) is downloaded and path to local temporary folder is put in href place
image extractor - downloads image in img tag to temporary folder and prints path to output file
download extractor - downloads item in anchor tag to temporary folder and prints path to output file

Selectors:

Both CSS and XPath are valid selectors. However XPath has to be written in special format (except when using .XPath(...) condition):

*[xpath>'%path%']

where %path% is valid XPath. If XPath contains quotes write them in json as escaped double quotes (\").

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Files

README.md

Latest commit

History

README.md

File metadata and controls