Provide optional extraction directives #24

bejean · 2012-10-14T10:26:12Z

What about provide optional extraction directives ?

In a majority of cases the extraction algorithm woks great. But for some web sites it can fail to extract relevant content. For these web sites it could be possible to "help" snacktory to focus on a specific part of the page content by providing it a Jsoup selector. For instance, we could have something like :

ArticleTextExtractor extractor = new ArticleTextExtractor();
extractor.setTextSelector("div.article_content");
extractor.setTitleSelector("h2", "first");
String dateRegEx = "xxxx";
extractor.setDateSelector("#published", dateRegEx);

JResult res = extractor.extractContent(rawData);
text = res.getText();
title = res.getTitle();
date = res.getDate();

karussell · 2012-10-14T12:13:53Z

Hmmh I don't find this solution that useful as one could simply use jsoup directly for those failing sites. Also I would rather adapt the core to include the failing site. Let me think about it.

bejean · 2012-10-14T13:40:56Z

Provide a scope to snacktory for the text extraction means to use the snacktory algorithm within this scope. We still need snacktory algorithm.

karussell · 2012-10-15T06:08:58Z

I see what you mean!

…traction Fixed bad extraction of article body

karussell closed this as completed Oct 14, 2012

karussell reopened this Oct 14, 2012

karussell changed the title ~~Provide optionnal extraction directives~~ Provide optional extraction directives Apr 2, 2014

rborer pushed a commit to finity-ai/snacktory that referenced this issue Jun 19, 2017

Merge pull request karussell#24 from skyshard/abhishek/bad_content_ex…

632057a

…traction Fixed bad extraction of article body

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide optional extraction directives #24

Provide optional extraction directives #24

bejean commented Oct 14, 2012

karussell commented Oct 14, 2012

bejean commented Oct 14, 2012

karussell commented Oct 15, 2012

Provide optional extraction directives #24

Provide optional extraction directives #24

Comments

bejean commented Oct 14, 2012

karussell commented Oct 14, 2012

bejean commented Oct 14, 2012

karussell commented Oct 15, 2012