You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 9, 2021. It is now read-only.
What about provide optional extraction directives ?
In a majority of cases the extraction algorithm woks great. But for some web sites it can fail to extract relevant content. For these web sites it could be possible to "help" snacktory to focus on a specific part of the page content by providing it a Jsoup selector. For instance, we could have something like :
Hmmh I don't find this solution that useful as one could simply use jsoup directly for those failing sites. Also I would rather adapt the core to include the failing site. Let me think about it.
What about provide optional extraction directives ?
In a majority of cases the extraction algorithm woks great. But for some web sites it can fail to extract relevant content. For these web sites it could be possible to "help" snacktory to focus on a specific part of the page content by providing it a Jsoup selector. For instance, we could have something like :
ArticleTextExtractor extractor = new ArticleTextExtractor();
extractor.setTextSelector("div.article_content");
extractor.setTitleSelector("h2", "first");
String dateRegEx = "xxxx";
extractor.setDateSelector("#published", dateRegEx);
JResult res = extractor.extractContent(rawData);
text = res.getText();
title = res.getTitle();
date = res.getDate();
The text was updated successfully, but these errors were encountered: