Skip to content
This repository has been archived by the owner on Mar 9, 2021. It is now read-only.

Many websites only extract partial content #42

Open
rubdottocom opened this issue Jan 23, 2015 · 7 comments
Open

Many websites only extract partial content #42

rubdottocom opened this issue Jan 23, 2015 · 7 comments

Comments

@rubdottocom
Copy link

Hi Peter,

I notice that I can extract only a part of the content of many websites, for example this site: http://sheldonbrown.com/brandt/patching.html I only get a part of the article, starting from: "Assuming that a patch was properly".

Do you know if the reason is because the library needs more development to complete the TODO: "only top text supported at the moment".

If so, could you give me some guidelines on how can I work to improve that?

Thank you so much.

@karussell
Copy link
Owner

I'm no longer actively developing this library. Most work I do is integrating pull requests (hint hint ;))

@rubdottocom
Copy link
Author

Yeah, I know :P I'm only asking for a point to start, but well... I'll try to find where I can evolve the library.

@incubator
Copy link

A quick and dirty solution is to use JSoup to alter the DOM before extractContent is run which would be in fetchAsString. Here's a quick class that restructures articles from Slate.com. A more generalized solution would be ideal, but this works well enough for our current needs.

public class SlateOverrideFetcher extends HtmlFetcher {

    @Override
    public String fetchAsString(String urlAsString, int timeout)
            throws MalformedURLException, IOException {

        String result = super.fetchAsString(urlAsString, timeout, true);
        return removeDiv(result);
    }

    /**
     * Remove extraneous Div tags in section.content
     */
    protected String removeDiv(String html) {
        String htmlFinal = null;

        StringBuilder builder = new StringBuilder();
        Document doc = Jsoup.parse(html, "UTF-8");
        Element content = doc.select(".content").first();
        Elements divs = doc.select(".text, .section, .parbase");

        for (Element div : divs) {
            String targetHtml = div.html();
            builder.append(targetHtml);
        }
        content.html(builder.toString());
        htmlFinal = doc.html();

        return htmlFinal;
    }

}

@haochun
Copy link

haochun commented Jun 11, 2015

I also encountered the same problem.example:http://www.2cto.com/kf/201310/249427.html .it only extract part text.

@haochun
Copy link

haochun commented Jun 11, 2015

@rubdottocom hello,do you resolve this bug now?

@nzv8fan
Copy link
Contributor

nzv8fan commented May 11, 2016

I've just created pull request #47 for a change that I've made that improves this issue - at least in my testing.

karussell added a commit that referenced this issue May 11, 2016
Fix Issue #42 to improve content identification
@karussell
Copy link
Owner

Merged. If someone wants to be added as a contributor - let me know via email!

rborer pushed a commit to finity-ai/snacktory that referenced this issue Jul 3, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants