Got Organizations? The key to powerful OA usage analytics

Organizational data is the key to demonstrating publishing impact for open content. Generic IP address data providers struggle to identify organizations beyond the telecoms and internet service providers that register IP addresses – curated organizational metadata is the answer.

Organizational data reveals the communities that are engaging with your content, and – when combined with place and subject – tell stories that communicate publishing impact. A policy think tank in Western Europe reading an international relations paper on weapons control. A medical institute in West Africa downloading new malarial research. A government department in East Asia researching papers on the relative benefits of renewable energy technologies.

In contrast, most usage analysis focuses on user data – such as where a user’s IP address is located – which tells us something about the individual user but nothing to put it in a broader organizational context. While it’s interesting to use IP address geolocation to see a breakdown of users by country, it’s really not actionable data if you’re aiming to demonstrate impact (a reader in Western Europe read a paper on weapons control – so what?). You don’t know why they accessed the content. And the growth of remote work makes a user’s location an increasingly meaningless statistic.

We recently tested a set of c. 20k random IP addresses from theIPregistry.org against one of the generic IP address databases that claims to be the most trusted source of data – only 1.5% of them were correctly matched to the affiliated organization. Why? This vendor was simply using the internet registration information, and so 98% of the organizations listed were Internet Service Providers … not very useful analytics.

Let’s dive a little deeper into why.

IP address metadata is not all the same

Understanding who’s accessing open content (or being denied access to paywalled content) is critical information to enable publishers to demonstrate impact. And one key thing you typically know about your users is what IP addresses they use.

Note: you could alternatively force people to log in, or use invasive and unethical tracking services – practices that are increasingly unsustainable in an environment where frictionless access and data privacy rule.

Databases of IP addresses are two-a-penny (or even free) on the internet – do a search for “IP address database” and you’ll find pages of providers. As IP addresses are freely discoverable, this isn’t surprising, so why is the quality of IP addresses such an issue for publishers?

The answer lies in IP address metadata. Simply put, it’s fairly easy to add geolocation data to IP addresses – and it’s very difficult to determine the user’s organizational affiliation.

Let’s start with what an IP address is (and isn’t): it’s a numerical identifier that uniquely identifies a connection to the internet. Because it identifies a connection point, it can also be used as a rough guide to location – or at least the location of that connection point. This might be where the user is located, or it might be where the server is located that the user’s access is being routed through (which may be quite different!).

Although IP addresses are formally registered and the name of the registering organization can be queried, this name is uncontrolled text that typically lists an intermediary, such as a telecoms company or internet service provider – not the name of the underlying organization that is actually using the IP address.

As a result, most commercial IP address databases focus heavily on geolocation as their value-add. Geolocation can be useful to do things like filter permissions based on geographical licensing or apply legal or regulatory requirements. Software applications can use clever algorithms to derive locations based on probe data (using response times to determine distance) and other hints, such as WHOIS registration data.

However, geolocation data is much less useful than organizational affiliation when it comes to understanding publishing impact.

Publishers need Organizational Affiliation

While it can be interesting to know that 5% of your users were from Brazil, based on geolocation data, it also can be highly misleading. That 5% may or may not be humans, and may or may not actually be based in Brazil (the user’s connection may simply be routed through a server in Brazil).

Much more valuable in terms of understanding publishing impact is the organizational affiliation of users. Are government entities using my material? If so, which ones? Which pharmaceutical companies in Japan, or think-tanks in France, were downloading our Ebooks? Was this groundbreaking research used by healthcare organizations in the areas of the world most impacted by the underlying condition?

But organizational affiliation is difficult. It obviously helps to start with a large base of curated IP addresses like theIPregistry.org. But a lot of painstaking research is needed to turn generic registration information into actionable organizational metadata, such as country, organizational type and category, and the research areas that organizations focus on.

The only reliably accurate way to link IP addresses to organizations is to have them validated by the organization itself, which is why theIPregistry.org’s database is uniquely valuable. TheIPregistry.org has cataloged over 12,500 organizations worldwide that provide (and more importantly update) their IP addresses, including identifying which ones are on campus or proxy remote users. As anyone who has ever tried to maintain IP addresses themselves knows, it’s an extremely painful process with many opportunities for data entry mistakes, and where the service provider is often the last one to know about a change. And it takes a lot of work to keep the information up-to-date: theIPregistry.org processes in excess of 2,000 updates, submitted by content subscribing organizations, every month. Our own experience at LibLynx mirrors the industry experience that over the last year up to 13% of IP addresses manually input into publishers’ authentication systems are inaccurate. Historically, when PSI has cleaned IP address data for publishers up to 40% of IP addresses have been removed.

Also, PSI invests heavily in the detailed research to index the organizations hidden behind the 98% of anonymous IP addresses, revealing new communities of users that were previously hidden to publishers.

So, have you got organizations? If not, then contact us to learn how we can help you demonstrate impact for open access publishing, or reveal the organizations denied access to your paywalled content and services.

Got Organizations? The key to powerful OA usage analytics