Blocking AI
Usually I am quite open to new ideas and techniques, but there are those rare occasions where I need to take a standpoint due to my ethical perspective. In this case, it is the use of AI. I have never used it, at least not from how I see it. I am not going to go into why I do this, but rather keep to the subject at hand: How to make sure that bots, crawlers and AI are blocked from your websites.
It started as a joke at first, someone mentioning how to confuse AI with an HTTP header. I did in fact add that header to my proxy. Then it got me thinking if there was a way to block, or at least reduce, the amount of bots that visit my sites. Well, sometimes you want them, so people can search for your site. But there are some that use this scraping for their own good. We cannot have that, can we? Therefore I first looked up robots.txt, which all bots should read and follow accordingly, but at most it is used for proper etiquette. I found a site mentioning how to do this 1, but the list was quite minimal, so I found a repository with a thousands of them 2. However, as it is not guaranteed that the bots will follow this, I continued to block User-agent
as well inside the reverse proxy. After reading more about how Nginx did this, I found a site giving me the initial thought for my integration 3.
Next was to combine the findings into the infrastructure I have built up with Ansible, starting with parsing an existing robots.txt file which is updated daily 24.
|
|
As the existing files with the blocking of AI generates these lists on the fly, and I would rather not copy their code, I instead download their generated robots.txt file and parse it. Due to this parsing, a filter was created to parse robots.txt and return a list of bots.
|
|
Then add $aiagent
as a check to block the bots from access, to all the running hosts.
|
|
Finally create two Ansible tasks which adds this to the reverse proxy.
|
|
I then tried this with curl to see if it worked curl -A 008 https://blog.aposoc.net/
, which returned 403
, as expected. Next is to make this feature optional for some sites which does not need it. Is this the best way to solve this? Not really, as this could also be ignored, so I dived deeper, found a couple of lists with IP-ranges and added them to an AIBots
group in my firewall 1.
Was this the nail in the coffin for the bots? Not at all. The only real way to deal with them is to block scrapers at all, which is close to impossible if done right. I do not think web scraping is bad, it is more what you use it for that makes me act, which I now have a wall to fend against companies feeding their AI. Web scraping in itself is another subject, and I might write about that in a later post.