-
Notifications
You must be signed in to change notification settings - Fork 159
Many websites only extract partial content #42
Comments
I'm no longer actively developing this library. Most work I do is integrating pull requests (hint hint ;)) |
Yeah, I know :P I'm only asking for a point to start, but well... I'll try to find where I can evolve the library. |
A quick and dirty solution is to use JSoup to alter the DOM before extractContent is run which would be in fetchAsString. Here's a quick class that restructures articles from Slate.com. A more generalized solution would be ideal, but this works well enough for our current needs. public class SlateOverrideFetcher extends HtmlFetcher {
@Override
public String fetchAsString(String urlAsString, int timeout)
throws MalformedURLException, IOException {
String result = super.fetchAsString(urlAsString, timeout, true);
return removeDiv(result);
}
/**
* Remove extraneous Div tags in section.content
*/
protected String removeDiv(String html) {
String htmlFinal = null;
StringBuilder builder = new StringBuilder();
Document doc = Jsoup.parse(html, "UTF-8");
Element content = doc.select(".content").first();
Elements divs = doc.select(".text, .section, .parbase");
for (Element div : divs) {
String targetHtml = div.html();
builder.append(targetHtml);
}
content.html(builder.toString());
htmlFinal = doc.html();
return htmlFinal;
}
} |
I also encountered the same problem.example:http://www.2cto.com/kf/201310/249427.html .it only extract part text. |
@rubdottocom hello,do you resolve this bug now? |
I've just created pull request #47 for a change that I've made that improves this issue - at least in my testing. |
Fix Issue #42 to improve content identification
Merged. If someone wants to be added as a contributor - let me know via email! |
…tent_extraction Fix extraction issues
Hi Peter,
I notice that I can extract only a part of the content of many websites, for example this site: http://sheldonbrown.com/brandt/patching.html I only get a part of the article, starting from: "Assuming that a patch was properly".
Do you know if the reason is because the library needs more development to complete the TODO: "only top text supported at the moment".
If so, could you give me some guidelines on how can I work to improve that?
Thank you so much.
The text was updated successfully, but these errors were encountered: