Love, Death & Robots: Scholarly Edition

Our CEO, Tim Lloyd, authored this post on managing robotic traffic in the Society for Scholarly Publishing’s blog, the Scholarly Kitchen. We’ve reproduced their article below, or you can read the Scholarly Kitchen’s original version.

You may be familiar with a Netflix series titled ‘Love, Death & Robots’ that features a diverse range of short stories (if not, I highly recommend it). The title reflects the complexity of our relationship with robotic futures, which swing from existential to utopian and everything in between.

Web-based robotic traffic, or ‘bots’ in the rest of this article, features the same complexity. Some bots overwhelm websites with their aggressive crawling, while other bots perform valuable site indexing. This article explores our increasingly ambiguous relationship with bots, and why scholarly content and service providers need a more nuanced approach to managing them.

The problem, part 1: AI

Bots have long been one of the more tedious aspects of website management. A necessary evil that provided some benefits, such as discovery, but were often viewed as a manageable nuisance.
However, the recent emergence of large language models (LLMs) has fueled AI-driven demand for training content. Handily, the significant expansion of open-access (OA) publishing models over the last decade greatly increased the supply of publicly available content, with roughly 50% of global scholarly content now published OA in one form or another. The result is growing hordes of bots mindlessly scraping everything they encounter, and at a scale we’ve never seen before.

The most immediate impact of bots is on platform capacity. Anyone managing a content platform will have experienced increasingly aggressive bot traffic in the last 12-18 months. At LibLynx, we first started experiencing spikes toward the end of 2024 as bots hit individual clients that use our access control service. By spring of 2025, multiple clients were being hit at the same time and we implemented sophisticated changes in our architecture to ensure traffic spikes didn’t impact availability.

Every publisher I’ve spoken to since has had the same experience. Content hosting platforms like OA and institutional repositories, and even library services, have also been victims of the bots. The costs are significant as platforms trade off (i) increasing (mostly idle) server capacity in order to cope with spikes, versus (ii) the disruption involved in blocking IP ranges and temporarily provisioning extra capacity on demand.

A less visible medium-term consequence is poisoning OA usage data, which much of our community relies on to justify ongoing support for publishing and hosting open content. OA content platforms – both publishers and repositories – are reporting massive increases in usage in the last 1-2 years, but one of the industry’s worst kept secrets is that this usage doesn’t reflect human engagement with scholarly content. Many publishers have privately shared their fears that OA usage stats are effectively junk, and our research indicates that the same may be true of institutional repositories. If you publish any open content and rely on usage reporting to support that business model, then you should be seriously concerned.

But there’s another wrinkle, which is that the advent of AI has also created a growing ecosystem of bots crawling content for training and/or inference in AI models. Content providers will value much of this traffic for discovery. Some of it also comprises agentic bots acting on behalf of users, which represents valuable usage that can help to replace the elephant-sized gap in usage reports caused by zero-click searches. We can’t simply block bots – we need to identify them and determine their intent before we can effectively manage bot activity.

The problem, part 2: Bot Management Solutions

Unfortunately, traditional bot management solutions are typically designed to meet the needs of global corporate enterprises driven primarily by security – and not the needs of scholarly publishers. The 800lb gorilla in this room is Cloudflare, which is likely used by most scholarly organisations (including ourselves at LibLynx).

Tools like Cloudflare do a fine job of protecting networks, but they lack the granularity and control that publishers need to determine intent. Our own experience, and that of many publishers and platform vendors we’ve spoken with, is that Cloudflare only seems to offer two modes of bot management:

Strong: Apply strict controls and you block it all – which unfortunately includes cornerstones of publishing, such as proxied users (who look suspicious because lots of different devices and user agents flow through a single IP address) and the sort of AI bots that we want to let into our content (like text/data mining applications). So, this strict model doesn’t work for publishers.

Weak: The alternative is weak controls that don’t really stop much at all – the aggressive crawling continues and the bad bots are mixed in with the good. A lot of publishers are therefore stuck with the weak controls – which are ineffective and imprecise, but at least your publishing platform can still operate.

There are obviously many other solutions available, aside from Cloudflare. Most of them rely on a combination of tools and techniques (the Swiss cheese model of defense in depth), often including combinations of:

Hints that the access request comes from a bot, e.g. user agents, suspicious headers, browser fingerprinting;
Tasks that bots find hard (expensive) to solve, such as Captcha or proof-of-work challenges; and
Block lists of IP addresses that require constant updating.

In my conversations with a wide range of organisations across our community that are struggling with bot management, it’s clear these approaches bring their own problems: significant ongoing maintenance; a poorer user experience; privacy concerns; steep technical learning curves; and, ultimately, they just don’t work that well. Our own experience has been similar, with several hours recently invested in identifying and blocking a range of IP addresses from a geographically based botnet as part of the ongoing game of whack-a-mole.

The scholarly community needs a more nuanced approach to bots.

Our research

Service providers, like LibLynx, are at ground zero for this issue, as we manage access control for c. 30 scholarly publishers from across the community. We process 100s of millions of requests each month, and even small bot attacks hitting multiple publishers at once can feel like a tsunami when it hits our infrastructure. Which is to say that we’re really motivated to better understand this problem. On the basis that you can’t control what you can’t measure, we decided to do some R&D on bots.

We analysed a sample of 50m access requests in December 2025, generated from a wide range of our publishing partners, using a combination of techniques to classify and categorize bot traffic. We started off by extracting a wide range of known bot signatures, such as the number and range of URLs requested, the volume of requests, the pattern of requests, etc. We then used machine learning techniques to analyze these signals and create clusters of activity associated with specific behaviors, such as Distributed Denial of Service (DDoS)-style attacks or vulnerability scanning. We also used IP-range data to validate legitimate bots and identify spoofing.

At a high level, we’re classifying bots into Verified and Unverified.

Verified bots are ones where we’re able to confirm that they are who they claim to be, typically by verifying their usage agent and IP address with the parent organization. They automate valuable tasks that would be impossible for humans to perform manually at scale, and come in lots of different flavours. Examples include Search Engine Optimization (SEO) and site visibility, site monitoring and performance tracking, customer support and engagement (e.g. Chatbots), content aggregation, and data accuracy and validation.

Unverified bots are ones that we can’t verify as legitimate. We further break these down into 2 groups:

Aggressive: These bots can be destructive to platform stability because of the scale of their activity or the nature of what they’re doing. Sub-groups we’ve classified based on their behavior include Volume Crawlers, DDoS-like bots, Vulnerability Scanners, and Burst Scrapers.
Sophisticated: These bots put more effort into hiding their activity. Sub-groups here include Suspicious User Agent and Spoofed Crawler (bots pretending to be legitimate).

One of the challenges of managing bots is simply understanding their intent so you can decide whether they’re harmful, benign, or actively beneficial in relation to your scholarly platform. Unless you’re actively working in infrastructure engineering, long lists of bot names are unlikely to mean much. While there are plenty of sites that index bots, their focus is on basic descriptions of what they do, and it takes cognitive effort to translate that into what it means for your content and services

Accordingly, we’re also developing a taxonomy of bots designed for non-technical users in scholarly publishing so they can easily make informed decisions about which bots to let onto their platforms and how to customize those journeys through their platforms. We’ll be testing this taxonomy with a variety of community stakeholders over the next few months and will share our results with the community once we’ve determined what works best.

What did we find?

First, we identified that 64% of the access requests were from human users, which meant that 36% or just over one-third were robotic in nature. We flagged a further 14% as bots that we would allow in because we were able to verify them as legitimate. And a bigger group of 21% that we would block because we were unable to verify them.

(Image 1: the proportion of human vs robotic traffic in a sample of 50m access requests across c. 30 scholarly publishers)

In both cases, the bot identifications were made with high certainty because of the strong correlation between the attributes and behavior of these access requests with known bot activity. Overall, less than 1.5% of the bot traffic was identified with low or medium certainty, which indicates that the analysis is very effective. If it looks and behaves like a bot, it’s almost certainly a bot

We’ve since done a similar analysis of usage from a large publisher who doesn’t use LibLynx and the results were virtually identical, so we’re confident that our analysis is also more broadly representative.

How do we categorize bot activity?

Of the original 50m access requests, 18m represented bots of various types. The diagram below ranks the top 30 or so bot categories out of a total of over 80 that we identified in our sample.

(Image 2: a breakdown of 18m bot access requests by category, from largest to smallest)

Broadly speaking, most of the bot activity falls into the aggressive or malicious category (in red) – which is why all our platforms struggle under their weight. In the middle ground are some legitimate bots (in blue). And the long tail includes a lot of lesser known bots and spoofed bots (in pink). In the sampling above, categories of bot requests are displayed based on our taxonomy of bot activities.

There’s a lot going on here. And the more you investigate, the more complexity you find.

What about a benchmark?

Since doing our initial analysis in December, we’ve also processed some samples on behalf of some third parties who aren’t current clients of ours. This example compares the activity on our platform with that on the institutional repository of a major US research institution.

(Image 3: the proportion of human vs robotic traffic in a sample of 30m access requests to an institutional repository)

You’ll recall from my earlier pie chart (Image 1) that the green slice of the pie reflects human usage across our publishing clients, which accounts for roughly two-thirds of activity. That falls to only one quarter of total access requests to the institutional repository.

More interesting is how much of their activity is coming from verified bots, which is just over half, in our analysis, compared to only 14% across the 30 or so publishers we sampled using LibLynx for access control. Intuitively, this makes sense: the full text open access content in repositories is rich honey for verified robot bees, while controlled access scholarly content offers more limited value (metadata). In addition, open repositories tend to have less active curbs on bots due to their mission (dissemination of knowledge) and more limited infrastructure resources. Image 3 is a good visual representation of the problem that OA repositories have with usage reporting flooded by bot activity.

Solutions for Scholarly Publishing

One of my key takeaways from this analysis is that we need a more nuanced approach to robotic traffic.

Historically, we’ve tended to view bots as a necessary but manageable evil. We’ve also always had a certain amount of bots that we wanted to let into our platforms for SEO and discovery purposes, and we’ve always dealt with the occasional unwanted probing of our sites or straight up bot attacks. But they were not too frequent and generally manageable.

And we have actively worked with decreasing success to exclude bots from our usage reports, which have traditionally focused on human usage only.

We are now entering an environment where robotic traffic is replacing human traffic. And the poster child here is obviously AI discovery, where agentic tools are accessing content and services behind the scenes to prepare summary answers provided back to our users.

We can expect a growing army of bot-driven services that will assist us in making our content and services more discoverable. A good example is OpenAI, which already deploys 3 different bots for different purposes:

GPTBot crawls content for training LLMs,
OAI-SearchBot indexes websites for inclusion in answers, and
ChatGPT-User is used by AI agents to respond to certain user actions.

Many more bots will be in development and get deployed over the coming years. We’ll want to include some proportion of that legitimate bot traffic in our usage reports, because they’ll reflect legitimate human actions. Using the OpenAI example, the last of those three bots is essentially an AI agent and therefore represents user-driven activity that we may want to include in usage reports. And we’re already facing that growing locust-like horde of bots that we want to filter out and turn away.

We need to re-imagine the goal as filtering bots, not blocking bots.

The first step is to detect robotic traffic based on their behavior and classify them based on their intent. This allows content and service platforms to make a fast, up-front decision on whether or not to process an access request or simply block it.

The second step is to attach granular metadata to those access requests that enter the platform so you can make informed decisions about what level of service to provide. For example:

Do you let this bot crawl your content? Maybe there’s a more efficient way to enable that now that you know the request is a bot not a human user? For example, a headless architecture where the “body” (a content repository) is separated from the “head” (frontend website) could enable content to be more efficiently delivered to bots without impacting website performance.
Do you let some bots access enhanced metadata? This could be metadata curated to drive discovery in other systems, beyond what is usually available.
Does identifying bots interested in your content create commercial opportunities? Maybe you want to contact the service operating the bot to see if they’d like to work with you on improving access (or maybe pay for enhanced access)?
Do bots appear in your usage reporting? Having this metadata can also help you decide which robotic traffic should also flow into usage reports.

Takeaways

Scholarly publishing needs better solutions that:

Work automatically in the background without requiring expensive engineers to play whack-a-mole
Don’t require deep technical skills to configure and manage
Minimally impact the user experience
Protect user privacy by capturing and storing minimal information
Filter in the bots that scholarly publishers value, while filtering out the rest

One of my favourite short stories from ‘Love, Death & Robots’ is about three robots visiting earth long after humans wiped themselves out with their poor choices (‘Three Robots’). It’s sharply humorous, with the humans in the story killed by “the long heedless autumn of their own self-regard”. Let’s make better choices with our bot future, starting with a more nuanced approach to filtering our bots.

Love, Death & Robots: Scholarly Edition