Skip to content
This repository has been archived by the owner on Mar 9, 2021. It is now read-only.

Provide optional extraction directives #24

Open
bejean opened this issue Oct 14, 2012 · 3 comments
Open

Provide optional extraction directives #24

bejean opened this issue Oct 14, 2012 · 3 comments

Comments

@bejean
Copy link
Contributor

bejean commented Oct 14, 2012

What about provide optional extraction directives ?

In a majority of cases the extraction algorithm woks great. But for some web sites it can fail to extract relevant content. For these web sites it could be possible to "help" snacktory to focus on a specific part of the page content by providing it a Jsoup selector. For instance, we could have something like :

ArticleTextExtractor extractor = new ArticleTextExtractor();
extractor.setTextSelector("div.article_content");
extractor.setTitleSelector("h2", "first");
String dateRegEx = "xxxx";
extractor.setDateSelector("#published", dateRegEx);

JResult res = extractor.extractContent(rawData);
text = res.getText();
title = res.getTitle();
date = res.getDate();

@karussell
Copy link
Owner

Hmmh I don't find this solution that useful as one could simply use jsoup directly for those failing sites. Also I would rather adapt the core to include the failing site. Let me think about it.

@karussell karussell reopened this Oct 14, 2012
@bejean
Copy link
Contributor Author

bejean commented Oct 14, 2012

Provide a scope to snacktory for the text extraction means to use the snacktory algorithm within this scope. We still need snacktory algorithm.

@karussell
Copy link
Owner

I see what you mean!

@karussell karussell changed the title Provide optionnal extraction directives Provide optional extraction directives Apr 2, 2014
rborer pushed a commit to finity-ai/snacktory that referenced this issue Jun 19, 2017
…traction

Fixed bad extraction of article body
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants