Combating AI Bots

Once again I am requesting that Brave bots use a proper user agent to identify themselves, or better yet, publish the IP ranges that they crawl from like Google and Bing do. I totally understand the reasoning behind https://search.brave.com/help/brave-search-crawler however the internet is a changed place with the advent of AI and this is no longer a valid approach.

Let me explain.

I manage hosting for a plethora of websites, two of which are very large forums (eevblog.com and forums.realgm.com). These two forums are seeing an ENORMOUS amount of bot traffic originating from IP ranges that belong to data centers such as Microsoft, GoogleCloud, OVH, DigitalOcean, etc. A huge amount of the traffic is coming from GTT.net, Tencent, Amazon and Bytedance.

Most of these bots fake normal user-agents to avoid being filtered out, they ignore crawl rate limits, and take smaller sites offline once they decide to start crawling them. As such we have had to take measures to protect them, such as forcing all IPs that originate from data centres to always go through a Captcha process.

This has been very successful in solving the bot problem, and to ensure that our search rankings are not affected we use the published IP ranges by search engines like Google, Bing, Yandex, etc… or identify the search engines by IP network ownership (ie, AppleBot) and let these through.

To give an idea of how problematic this is, here are the bandwidth stats for EEVBlog before and after implementing anti-bot measures (note that each large reduction in traffic was as we found another datacenter owned range to flag that was abusive):

The problem is, Brave’s policy is now impacting the search results. Already EEVBlog is showing up as a “Captcha” prompt in your search results.

We have no way to fix this and with the way AI is going, eventually all websites will end up listed like this in your search engine making your search engine useless.

You need to re-asses this policy ASAP before your search engine becomes unusable.

Perhaps even evaluate a middle ground where you crawl as you are now, but if you get a captcha response re-queue the crawl with one of the published IP ranges with a bot that identifies itself as your crawler. You could even cache that the domain requires this so you don’t need to do this on future crawls.

@gnif

Note to self, in order to be clear on expected search results . . .

Currently using Brave Browser (iOS).

Brave Search (AI assistance and Discussions settings are Disabled) result shows the EEVBlog Cloudflare CAPTCHA link - re OP’s concern:

DuckDuckGo Search (all settings switches are Disabled, except 2 re page breaks) result shows what I imagine EEVBlog and OP wish:

And a year 2023 post about Brave Search Crawling:

This topic was automatically closed after 60 days. New replies are no longer allowed.