Skip to content

Commit

Permalink
Adds detection for various bots (matomo-org#7739)
Browse files Browse the repository at this point in the history
* Add another user agent for Qwantify
* Add test for PagePeeker
* Add another test for SemrushBot
* Improves DuckDuckBot
* Adds detection for DuckAssistBot
* Adds detection for RedekenBot
* Adds detection for semaltbot
* Adds detection for MakeMerryBot
* Adds detection for Timpibot
* Add generic bot test
* Adds detection for ValidBot
* Adds detection for NameProtect
* Adds detection for CLASSLA-web
* Add generic bot test
* Improves detection for generic bots
* Move heritrix at the bottom
* Fix Arquivo.pt test
* Adds detection for Domain Codex
* Adds detection for Swisscows Favicons
* Adds detection for leak.info
* Adds detection for Workona
* Adds detection for Bloglines
* Improves detection for generic bots
* Adds detection for Marginalia
* Adds detection for VU Server Health Scanner
* Improves detection for generic bots
* Improves detection for generic bots
* Improves detection for generic bots
* Adds detection for Functionize
* Adds detection for Prerender

---------

Co-authored-by: Tutik Alexsandr <[email protected]>
  • Loading branch information
liviuconcioiu and sanchezzzhak authored Aug 1, 2024
1 parent 6da4f09 commit 67b225e
Show file tree
Hide file tree
Showing 4 changed files with 388 additions and 31 deletions.
6 changes: 0 additions & 6 deletions Tests/Parser/Client/fixtures/mobile_app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2057,12 +2057,6 @@
type: mobile app
name: Teams
version: 24004.1304.2655.7488
-
user_agent: Report Runner
client:
type: mobile app
name: Report Runner
version: ""
-
user_agent: Mozilla/5.0 (iPhone; CPU iPhone OS 15_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 Zalo iOS/448 ZaloTheme/light ZaloLanguage/en
client:
Expand Down
251 changes: 242 additions & 9 deletions Tests/fixtures/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -831,18 +831,27 @@
-
user_agent: DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)
bot:
name: DuckDuckGo Bot
name: DuckDuckBot
category: Search bot
url: https://duckduckgo.com/duckduckbot
url: https://duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
producer:
name: DuckDuckGo
url: https://duckduckgo.com/
-
user_agent: Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)
bot:
name: DuckDuckGo Bot
name: DuckDuckBot
category: Search bot
url: https://duckduckgo.com/duckduckbot
url: https://duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
producer:
name: DuckDuckGo
url: https://duckduckgo.com/
-
user_agent: DuckAssistBot/1.1; (+http://duckduckgo.com/duckassistbot.html)
bot:
name: DuckAssistBot
category: Search bot
url: https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot/
producer:
name: DuckDuckGo
url: https://duckduckgo.com/
Expand Down Expand Up @@ -2475,7 +2484,16 @@
name: Quora
url: http://www.quora.com
-
user_agent: 'Mozilla/5.0 (compatible; Qwantify/2.2w; +https://www.qwant.com/)/*'
user_agent: Mozilla/5.0 (compatible; Qwantify/2.2w; +https://www.qwant.com/)
bot:
name: Qwantify
category: Crawler
url: https://www.qwant.com/
producer:
name: Qwant Corporation
url: https://www.qwant.com/
-
user_agent: Mozilla/5.0 (compatible; Qwantify-prod34997/1.0; +https://help.qwant.com/bot/)
bot:
name: Qwantify
category: Crawler
Expand Down Expand Up @@ -5063,6 +5081,15 @@
producer:
name: Jožef Stefan Institute
url: https://www.ijs.si/ijsw/JSI
-
user_agent: Mozilla/5.0 (compatible; CLASSLA-web; +https://www.clarin.si/info/classla-web-crawler/)
bot:
name: CLASSLA-web
category: Crawler
url: https://www.clarin.si/info/classla-web-crawler/
producer:
name: Jožef Stefan Institute
url: https://www.ijs.si/ijsw/JSI
-
user_agent: "Electronic Frontier Foundation's Do Not Track Verifier (for questions or concerns email [email protected])"
bot:
Expand Down Expand Up @@ -6705,12 +6732,12 @@
-
user_agent: Arquivo-web-crawler (compatible; heritrix/3.4.0-20200304 +https://arquivo.pt/faq-crawling)
bot:
name: Heritrix
name: Arquivo.pt
category: Crawler
url: https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
url: https://sobre.arquivo.pt/en/help/crawling-and-archiving-web-content/
producer:
name: The Internet Archive
url: https://archive.org
name: FCT|FCCN
url: https://www.fct.pt/
-
user_agent: Arquivo-web-crawler (compatible; brozzler/1.5 +https://arquivo.pt/faq-crawling)
bot:
Expand Down Expand Up @@ -7803,3 +7830,209 @@
producer:
name: Meins und Vogel GmbH
url: https://muv.com/
-
user_agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36 (compatible; PagePeeker/3.0; +https://pagepeeker.com/robots/)
bot:
name: PagePeeker
category: Crawler
url: https://pagepeeker.com/robots/
producer:
name: PAGEPEEKER SRL
url: https://pagepeeker.com/
-
user_agent: Mozilla/5.0 (compatible; SemrushBot-SWA/0.1; +http://www.semrush.com/bot.html)
bot:
name: SemrushBot
category: Crawler
url: https://www.semrush.com/bot/
producer:
name: Semrush Inc.
url: https://www.semrush.com/
-
user_agent: Mozilla/5.0 (compatible; RedekenBot/0.1; +https://www.redeken.com/bot/)
bot:
name: RedekenBot
category: Crawler
url: https://www.redeken.com/en/help/bot.html
producer:
name: Redeken
url: https://www.redeken.com/
-
user_agent: semaltbot/0.1 (+http://semalt.net)
bot:
name: semaltbot
category: Crawler
url: https://semalt.net/
producer:
name: Semalt LP
url: https://semalt.net/
-
user_agent: Mozilla/5.0 (compatible; MakeMerryBot/1.0; +https://makemerry.app/bots)
bot:
name: MakeMerryBot
category: Crawler
url: https://makemerry.app/bots
-
user_agent: Timpibot/0.9 (+http://www.timpi.io)
bot:
name: Timpibot
category: Crawler
url: https://timpi.io/
producer:
name: Timpi Inc.
url: https://timpi.io/
-
user_agent: Mozilla/5.0 (compatible; Timpibot/0.8; +http://www.timpi.io)
bot:
name: Timpibot
category: Crawler
url: https://timpi.io/
producer:
name: Timpi Inc.
url: https://timpi.io/
-
user_agent: 'Tublm.com/Bot/fubpdfdotcom/Bot/Bot -❤️- +https://tublm.com/game/2048_merge'
bot:
name: Generic Bot
-
user_agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.15 (compatible; Validbot; +https://www.validbot.com)
bot:
name: ValidBot
category: Crawler
url: https://www.validbot.com/
producer:
name: Jake Olefsky LLC
url: https://www.validbot.com/
-
user_agent: NPBot
bot:
name: NameProtectBot
category: Crawler
url: https://www.cscglobal.com/cscglobal/home/
producer:
name: NameProtect, Inc.
url: https://www.cscglobal.com/
-
user_agent: Mozilla/5.0 (compatible; CuriousCatgirl Research; +https://curiouscatgirl.cynthia.dev)
bot:
name: Generic Bot
-
user_agent: xx032_bo9vs83_2a
bot:
name: Generic Bot
-
user_agent: Mozilla/5.0 (compatible; heritrix/3.3.0-SNAPSHOT-20160721-2308 +https://www.domaincodex.com)
bot:
name: Domain Codex
category: Crawler
url: https://www.domaincodex.com/
producer:
name: Erie Data Systems, LLC
url: https://www.eriedatasys.com/
-
user_agent: Swisscows Favicons
bot:
name: Swisscows Favicons
category: Crawler
url: https://swisscows.com/
producer:
name: Swisscows AG
url: https://swisscows.com/
-
user_agent: Mozilla/4.0 (compatible; fluid/0.0; +http://www.leak.info/bot.html)
bot:
name: leak.info
category: Crawler
url: http://www.leak.info/
-
user_agent: workona-favicon-service/1.0.0
bot:
name: Workona
category: Crawler
url: https://workona.com/
producer:
name: Workona, Inc.
url: https://workona.com/
-
user_agent: Bloglines/3.1 (http://www.bloglines.com)
bot:
name: Bloglines
category: Crawler
url: https://web.archive.org/web/20140309033202/http://www.bloglines.com/
producer:
name: Reply!, Inc.
url: https://www.reply.com/
-
user_agent: 'shadowforce.io - sslshed/0.1'
bot:
name: Generic Bot
-
user_agent: search.marginalia.nu
bot:
name: Marginalia
category: Crawler
url: https://www.marginalia.nu/marginalia-search/for-webmasters/
producer:
name: Marginalia
url: https://www.marginalia.nu/
-
user_agent: Mozilla/5.0 (compatible;vu-server-health-scanner/1.0;https://130.37.198.75/index.html)
bot:
name: VU Server Health Scanner
category: Security Checker
url: https://130.37.198.75/index.html
producer:
name: VU Amsterdam
url: https://vu.nl/en
-
user_agent: Searcherxweb
bot:
name: Generic Bot
-
user_agent: Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion
bot:
name: Generic Bot
-
user_agent: Report Runner
bot:
name: Generic Bot
-
user_agent: Node.js
bot:
name: Generic Bot
-
user_agent: Mozilla/5.0 (X11; Windows x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Functionize
bot:
name: Functionize
category: Crawler
url: https://www.functionize.com/
producer:
name: Functionize, Inc.
url: https://www.functionize.com/
-
user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/W.X.Y.Z Safari/537.36 Prerender (+https://github.com/prerender/prerender)
bot:
name: Prerender
category: Crawler
url: https://docs.prerender.io/docs/33-overview-of-prerender-crawlers
producer:
name: saas.group Inc.
url: https://saas.group/
-
user_agent: Mozilla/5.0 (Linux; Android 11; Pixel 5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 Prerender (+https://github.com/prerender/prerender)
bot:
name: Prerender
category: Crawler
url: https://docs.prerender.io/docs/33-overview-of-prerender-crawlers
producer:
name: saas.group Inc.
url: https://saas.group/
-
user_agent: Prerender (+https://github.com/prerender/prerender)
bot:
name: Prerender
category: Crawler
url: https://docs.prerender.io/docs/33-overview-of-prerender-crawlers
producer:
name: saas.group Inc.
url: https://saas.group/
Loading

0 comments on commit 67b225e

Please sign in to comment.