-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitting the text in scoring #113
Comments
This is a typical counter-intuitive situation where "more is better" strategy doesn't work. More separators isn't better, because this purpose is made with a different goal in mind. |
Thank you for replying. Yes I have applied and actually it was effective. Before I was not able to get content of a page, just the footer, but after those changes I was able to get the content. May be it is because of the input I used. I used news pages. Because in most pages there may not be commas, but there is a big bunch of text, but in the footer there are lots of commas. For example,
. |
Thanks for a valid counter-example, this package is designed for news pages but was modeled from English ones and doesn't consider such use-case. Rather I would suggest a discount on commas counting then, and will consider its implementation in next update -- I'm trying to do package updates at least once per 3 months. |
@faridhaziyev please don't close this issue. |
In the score_paragraphs method content score is calculated like this:
content_score += len(inner_text.split(','))
But I think it should be like below, because there may be no comma in a text.
content_score += len(re.split(' |,',inner_text))
Also I think this may be added: Do not take into account non words and words with length less than 3
inner_text = " ".join(re.findall("[^\d\W]{3,}", inner_text))
The text was updated successfully, but these errors were encountered: