Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I dati non vengono piu estratti [Referer : None] #12

Open
trivius86 opened this issue Jan 4, 2024 · 2 comments
Open

I dati non vengono piu estratti [Referer : None] #12

trivius86 opened this issue Jan 4, 2024 · 2 comments

Comments

@trivius86
Copy link

Salve, purtroppo oggi volevo grattare di nuovo i dati ma al momento lo script sembra non estrarre piu nulla..
Vi mostro qui sotto cosa scrive:
Hi, unfortunately today I wanted to scratch the data again but at the moment the script doesn't seem to extract anything anymore..
I'll show you what he writes below:

`2024-01-04 16:10:59 [scrapy.utils.log] INFO: Scrapy 2.9.0 started (bot: sofifa)
2024-01-04 16:10:59 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.9.12 (main, Apr 5 2022, 01:53:17) - [Clang 12.0.0 ], pyOpenSSL 21.0.0 (OpenSSL 1.1.1n 15 Mar 2022), cryptography 3.4.8, Platform macOS-10.16-x86_64-i386-64bit
2024-01-04 16:10:59 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 32,
'BOT_NAME': 'sofifa',
'NEWSPIDER_MODULE': 'sofifa.spiders',
'SPIDER_MODULES': ['sofifa.spiders'],
'USER_AGENT': 'sofifa (+http://www.gof.com)'}
2024-01-04 16:10:59 [py.warnings] WARNING: /Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)

2024-01-04 16:10:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-01-04 16:10:59 [scrapy.extensions.telnet] INFO: Telnet Password: 58a2503b8ae9f86c
2024-01-04 16:10:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2024-01-04 16:10:59 [sofifa] INFO: Scraping year 240019
2024-01-04 16:10:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-01-04 16:10:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-01-04 16:10:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-01-04 16:10:59 [scrapy.core.engine] INFO: Spider opened
2024-01-04 16:10:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-04 16:10:59 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-01-04 16:10:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://sofifa.com/players?set=true&col=oa&sort=desc&showCol%5B%5D=pi&showCol%5B%5D=ae&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=bo&showCol%5B%5D=bp&showCol%5B%5D=gu&showCol%5B%5D=jt&showCol%5B%5D=le&showCol%5B%5D=vl&showCol%5B%5D=wg&showCol%5B%5D=rc&showCol%5B%5D=ta&showCol%5B%5D=cr&showCol%5B%5D=fi&showCol%5B%5D=he&showCol%5B%5D=sh&showCol%5B%5D=vo&showCol%5B%5D=ts&showCol%5B%5D=dr&showCol%5B%5D=cu&showCol%5B%5D=fr&showCol%5B%5D=lo&showCol%5B%5D=bl&showCol%5B%5D=to&showCol%5B%5D=ac&showCol%5B%5D=sp&showCol%5B%5D=ag&showCol%5B%5D=re&showCol%5B%5D=ba&showCol%5B%5D=tp&showCol%5B%5D=so&showCol%5B%5D=ju&showCol%5B%5D=st&showCol%5B%5D=sr&showCol%5B%5D=ln&showCol%5B%5D=te&showCol%5B%5D=ar&showCol%5B%5D=in&showCol%5B%5D=po&showCol%5B%5D=vi&showCol%5B%5D=pe&showCol%5B%5D=cm&showCol%5B%5D=td&showCol%5B%5D=ma&showCol%5B%5D=sa&showCol%5B%5D=sl&showCol%5B%5D=tg&showCol%5B%5D=gd&showCol%5B%5D=gh&showCol%5B%5D=gk&showCol%5B%5D=gp&showCol%5B%5D=gr&showCol%5B%5D=tt&showCol%5B%5D=bs&showCol%5B%5D=wk&showCol%5B%5D=sk&showCol%5B%5D=aw&showCol%5B%5D=dw&showCol%5B%5D=ir&showCol%5B%5D=pac&showCol%5B%5D=sho&showCol%5B%5D=pas&showCol%5B%5D=dri&showCol%5B%5D=def&showCol%5B%5D=phy&r=240019> (referer: None)`

E continua a mostrare la stessa schermata senza scaricare alcun dato. Cosa può essere?

And it continues to show the same screen without downloading any data. What can it be?

@m-kloeckner
Copy link

And it continues to show the same screen without downloading any data. What can it be?

Sofifa changed the code on their site a little bit therefore the clubname is not retrieved. See line 37 in sofifa.py

club_name = player.css("td:nth-child(6) div.ellipsis a::text").get()

As the club_name is not retrieved and is always empty, the current player is skipped.

if club_name is None or str.isdecimal(club_name): # skip fake players or players that do not play in any club continue

You can check my fork to see what is needed to be changed. Actually, a lot of Scrapy selectors need to be changed. I might merge my changes, but still working on improvements.

@trivius86
Copy link
Author

in reality I was also able to solve the problem and downloaded it but now if I try again the site always manages to prevent me from accessing as it gives me the 403 error
in realtà ero anche riuscito a risolvere e avevo scaricato ma ora se riprovo sempre che il sito riesca ad impedirmi l'accesso visto che mi restituisce l'errore 403

`2024-03-28 16:30:16 [scrapy.utils.log] INFO: Scrapy 2.9.0 started (bot: sofifa)
2024-03-28 16:30:16 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.9.12 (main, Apr 5 2022, 01:53:17) - [Clang 12.0.0 ], pyOpenSSL 21.0.0 (OpenSSL 1.1.1n 15 Mar 2022), cryptography 3.4.8, Platform macOS-10.16-x86_64-i386-64bit
2024-03-28 16:30:16 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 32,
'BOT_NAME': 'sofifa',
'NEWSPIDER_MODULE': 'sofifa.spiders',
'SPIDER_MODULES': ['sofifa.spiders'],
'USER_AGENT': 'sofifa (http://www.fifaproject.com)'}
2024-03-28 16:30:16 [py.warnings] WARNING: /Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)

2024-03-28 16:30:16 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-03-28 16:30:16 [scrapy.extensions.telnet] INFO: Telnet Password: d7d6f7b7f05ee763
2024-03-28 16:30:16 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2024-03-28 16:30:16 [sofifa] INFO: Scraping year 240024
2024-03-28 16:30:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-03-28 16:30:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-03-28 16:30:16 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-03-28 16:30:16 [scrapy.core.engine] INFO: Spider opened
2024-03-28 16:30:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-28 16:30:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-03-28 16:30:16 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://sofifa.com/players?set=true&col=oa&sort=desc&showCol%5B%5D=pi&showCol%5B%5D=ae&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=bo&showCol%5B%5D=bp&showCol%5B%5D=gu&showCol%5B%5D=jt&showCol%5B%5D=le&showCol%5B%5D=vl&showCol%5B%5D=wg&showCol%5B%5D=rc&showCol%5B%5D=ta&showCol%5B%5D=cr&showCol%5B%5D=fi&showCol%5B%5D=he&showCol%5B%5D=sh&showCol%5B%5D=vo&showCol%5B%5D=ts&showCol%5B%5D=dr&showCol%5B%5D=cu&showCol%5B%5D=fr&showCol%5B%5D=lo&showCol%5B%5D=bl&showCol%5B%5D=to&showCol%5B%5D=ac&showCol%5B%5D=sp&showCol%5B%5D=ag&showCol%5B%5D=re&showCol%5B%5D=ba&showCol%5B%5D=tp&showCol%5B%5D=so&showCol%5B%5D=ju&showCol%5B%5D=st&showCol%5B%5D=sr&showCol%5B%5D=ln&showCol%5B%5D=te&showCol%5B%5D=ar&showCol%5B%5D=in&showCol%5B%5D=po&showCol%5B%5D=vi&showCol%5B%5D=pe&showCol%5B%5D=cm&showCol%5B%5D=td&showCol%5B%5D=ma&showCol%5B%5D=sa&showCol%5B%5D=sl&showCol%5B%5D=tg&showCol%5B%5D=gd&showCol%5B%5D=gh&showCol%5B%5D=gk&showCol%5B%5D=gp&showCol%5B%5D=gr&showCol%5B%5D=tt&showCol%5B%5D=bs&showCol%5B%5D=wk&showCol%5B%5D=sk&showCol%5B%5D=aw&showCol%5B%5D=dw&showCol%5B%5D=ir&showCol%5B%5D=pac&showCol%5B%5D=sho&showCol%5B%5D=pas&showCol%5B%5D=dri&showCol%5B%5D=def&showCol%5B%5D=phy&r=240024> (referer: None)
2024-03-28 16:30:16 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://sofifa.com/players?set=true&col=oa&sort=desc&showCol%5B%5D=pi&showCol%5B%5D=ae&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=bo&showCol%5B%5D=bp&showCol%5B%5D=gu&showCol%5B%5D=jt&showCol%5B%5D=le&showCol%5B%5D=vl&showCol%5B%5D=wg&showCol%5B%5D=rc&showCol%5B%5D=ta&showCol%5B%5D=cr&showCol%5B%5D=fi&showCol%5B%5D=he&showCol%5B%5D=sh&showCol%5B%5D=vo&showCol%5B%5D=ts&showCol%5B%5D=dr&showCol%5B%5D=cu&showCol%5B%5D=fr&showCol%5B%5D=lo&showCol%5B%5D=bl&showCol%5B%5D=to&showCol%5B%5D=ac&showCol%5B%5D=sp&showCol%5B%5D=ag&showCol%5B%5D=re&showCol%5B%5D=ba&showCol%5B%5D=tp&showCol%5B%5D=so&showCol%5B%5D=ju&showCol%5B%5D=st&showCol%5B%5D=sr&showCol%5B%5D=ln&showCol%5B%5D=te&showCol%5B%5D=ar&showCol%5B%5D=in&showCol%5B%5D=po&showCol%5B%5D=vi&showCol%5B%5D=pe&showCol%5B%5D=cm&showCol%5B%5D=td&showCol%5B%5D=ma&showCol%5B%5D=sa&showCol%5B%5D=sl&showCol%5B%5D=tg&showCol%5B%5D=gd&showCol%5B%5D=gh&showCol%5B%5D=gk&showCol%5B%5D=gp&showCol%5B%5D=gr&showCol%5B%5D=tt&showCol%5B%5D=bs&showCol%5B%5D=wk&showCol%5B%5D=sk&showCol%5B%5D=aw&showCol%5B%5D=dw&showCol%5B%5D=ir&showCol%5B%5D=pac&showCol%5B%5D=sho&showCol%5B%5D=pas&showCol%5B%5D=dri&showCol%5B%5D=def&showCol%5B%5D=phy&r=240024>: HTTP status code is not handled or not allowed
2024-03-28 16:30:16 [scrapy.core.engine] INFO: Closing spider (finished)
2024-03-28 16:30:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1489,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 11403,
'downloader/response_count': 1,
'downloader/response_status_count/403': 1,
'elapsed_time_seconds': 0.292422,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 3, 28, 15, 30, 16, 875328),
'httpcompression/response_bytes': 24337,
'httpcompression/response_count': 1,
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/403': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 12,
'log_count/WARNING': 1,
'memusage/max': 63377408,
'memusage/startup': 63377408,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2024, 3, 28, 15, 30, 16, 582906)}
2024-03-28 16:30:16 [scrapy.core.engine] INFO: Spider closed (finished)
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants