Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Donc summary() won't work on this web site #112

Open
MChrys opened this issue Jan 23, 2019 · 0 comments
Open

Donc summary() won't work on this web site #112

MChrys opened this issue Jan 23, 2019 · 0 comments

Comments

@MChrys
Copy link

MChrys commented Jan 23, 2019

Summary() seem don't work on website where text is spliting() in many tag .
I encoutered this problem specifically on this web site :
https://start.lesechos.fr/actu-entreprises/services/a-19-ans-il-est-le-plus-jeune-patissier-prime-au-guide-michelin-13983.php

url = "https://start.lesechos.fr/actu-entreprises/services/a-19-ans-il-est-le-plus-jeune-patissier-prime-au-guide-michelin-13983.php"
page = requests.get(url).text
doc  = Document(page)
doc.summary()
<html><body><div><div id="outer-main">\n\n\n\n<p class="ads tag1">\n\n</p>\n\n\n\n\n\n\n<a
href="" target="_blank" class="btn-piston "/>\n\n\n\n<article>\n<div id="content">\n<div
id="news">\n<div class="grid">\n<div class="contain">\n<div class="row">\n\n<div class="col
full">\n\n<span class="cat">Délices sucrés</span>\n<h1 class="page-title nobg">\nA 19 ans, il est le
plus jeune pâtissier primé au Guide Michelin</h1>\n<p class="meta">\n<span class="author">\nPar
Camille Wong</span>\n|\n<time datetime="2019-01-22T13:12">\n22/01/2019 à 14:30,</time>\nmis à
jour le 22/01/2019</p>\n\n\n<div class="picture first">\n<figure>\n\n<figcaption>\n<p
class="legend">Jessy Rhinn-Auvray (à gauche), 19 ans, et son mentor Nicolas Stamm, 46 ans, lors de la
cérémonie du Guide Michelin, le 21 janvier.\n <strong>@DR</strong
</p>\n</figcaption>\n</figure>\n</div>\n</div>\n\n</div>\n</div>\n</div>\n</div>\n</div>\n<
article>\n\n</div>\n\n\n</div></body></html>

almost all paragraph doesn't appear :

image

maybe you could add an option for Document object like :

if aggregation_mean == True: 
    aggregation = ""
    max = self.select_best_candidate(candidates).score
    min = self.select_worst_candidate(candidates).score
    for c in candidates : 
        if c.score >= max-min :  
            aggregation += c.text

return aggregation

I just tried to activate readable mode on safari , it's working perfectly on this page, it seems based on arc 90's as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant