Many websites only extract partial content #42

rubdottocom · 2015-01-23T17:07:10Z

Hi Peter,

I notice that I can extract only a part of the content of many websites, for example this site: http://sheldonbrown.com/brandt/patching.html I only get a part of the article, starting from: "Assuming that a patch was properly".

Do you know if the reason is because the library needs more development to complete the TODO: "only top text supported at the moment".

If so, could you give me some guidelines on how can I work to improve that?

Thank you so much.

karussell · 2015-01-23T17:31:32Z

I'm no longer actively developing this library. Most work I do is integrating pull requests (hint hint ;))

rubdottocom · 2015-01-23T17:40:10Z

Yeah, I know :P I'm only asking for a point to start, but well... I'll try to find where I can evolve the library.

incubator · 2015-01-24T19:22:11Z

A quick and dirty solution is to use JSoup to alter the DOM before extractContent is run which would be in fetchAsString. Here's a quick class that restructures articles from Slate.com. A more generalized solution would be ideal, but this works well enough for our current needs.

public class SlateOverrideFetcher extends HtmlFetcher {

    @Override
    public String fetchAsString(String urlAsString, int timeout)
            throws MalformedURLException, IOException {

        String result = super.fetchAsString(urlAsString, timeout, true);
        return removeDiv(result);
    }

    /**
     * Remove extraneous Div tags in section.content
     */
    protected String removeDiv(String html) {
        String htmlFinal = null;

        StringBuilder builder = new StringBuilder();
        Document doc = Jsoup.parse(html, "UTF-8");
        Element content = doc.select(".content").first();
        Elements divs = doc.select(".text, .section, .parbase");

        for (Element div : divs) {
            String targetHtml = div.html();
            builder.append(targetHtml);
        }
        content.html(builder.toString());
        htmlFinal = doc.html();

        return htmlFinal;
    }

}

haochun · 2015-06-11T09:13:26Z

I also encountered the same problem.example:http://www.2cto.com/kf/201310/249427.html .it only extract part text.

haochun · 2015-06-11T09:17:37Z

@rubdottocom hello,do you resolve this bug now?

nzv8fan · 2016-05-11T03:49:12Z

I've just created pull request #47 for a change that I've made that improves this issue - at least in my testing.

Fix Issue #42 to improve content identification

karussell · 2016-05-11T08:46:54Z

Merged. If someone wants to be added as a contributor - let me know via email!

…tent_extraction Fix extraction issues

karussell added a commit that referenced this issue May 11, 2016

Merge pull request #47 from nzv8fan/issue42

cb24ab4

Fix Issue #42 to improve content identification

rborer pushed a commit to finity-ai/snacktory that referenced this issue Jul 3, 2017

Merge pull request karussell#42 from skyshard/abhishek/fix_author_con…

2a917e7

…tent_extraction Fix extraction issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many websites only extract partial content #42

Many websites only extract partial content #42

rubdottocom commented Jan 23, 2015

karussell commented Jan 23, 2015

rubdottocom commented Jan 23, 2015

incubator commented Jan 24, 2015

haochun commented Jun 11, 2015

haochun commented Jun 11, 2015

nzv8fan commented May 11, 2016

karussell commented May 11, 2016

Many websites only extract partial content #42

Many websites only extract partial content #42

Comments

rubdottocom commented Jan 23, 2015

karussell commented Jan 23, 2015

rubdottocom commented Jan 23, 2015

incubator commented Jan 24, 2015

haochun commented Jun 11, 2015

haochun commented Jun 11, 2015

nzv8fan commented May 11, 2016

karussell commented May 11, 2016