Open-Source Data Is a Risky Gamble for Businesses
Many organizations rely on open-source data to reap the benefits of data-driven intelligence without paying third-party providers. AI companies, for instance, often leverage this data to train machine-learning models, while the largest global companies use it to fuel their products and operations. These datasets come from a variety of sources, including governments, intergovernmental organizations (IGOs), and nongovernmental organizations (NGOs), universities, and online platforms.
But while open-source data can save companies money up front, using it comes with downsides. Often, these datasets contain inaccuracies that need to be fixed before the data can be analyzed. Other times, “open-source” data is anything but — it comes with licensing restrictions (that may prevent it from being used for commercial purposes) and source-specific biases (that skew data-driven insights).
With these risks in mind, let’s take a closer look at key providers of open-source data as well as at the pros and cons of using it to guide corporate strategy.
Where to find open-source data
Open-source data is easy to obtain. With the click of a button, anyone can discover that the US government offers free data, including demographic datasets, business and employment data, as well as more general statistics and indexes. Elsewhere, IGOs such as the World Bank provide economic and development indicators for countries around the globe. Universities often fund research centers that offer open-source data, such as Columbia University’s Center on Poverty and Social Policy.
Many companies use these open-source platforms to access free data. For example, AI firms regularly use community published data from Kaggle to train their models. Equally notable in this regard is OpenStreetMap (OSM), which the world’s largest corporations and governments around the world use for location data.
When using free data goes wrong
Given the accessibility and widespread use of open-source data, it’s important to think about some of the business problems these datasets present.
Imagine that Wendy’s is trying to obtain demographic information about a US city in which they want to open a new location. In hopes of better understanding their potential customer base, they develop insights using data from the US Census Bureau.
In choosing this route, Wendy’s could run into an issue that often arises with free data provided by governments. Notably, the US census is only conducted once every ten years. Given that the last census was taken in 2020, this data might fail to fully account for the urban exodus that many cities experienced in the wake of COVID-19; this failing could lead to misinformed strategic decisions. This gap highlights the more general problem with open-source government data, which tends to lose integrity quickly after it’s released due to a lack of updates — and is not replenished and tested for accuracy as frequently as more expensive proprietary datasets.
OpenStreetMap provides another example of the risks associated with using open-source data for strategic decision making. Launched in 2004, OSM is the leading database for crowdsourced volunteered geographic information (VGI). Many consider the service a neutral source of location data: the “Wikipedia for maps.”
In recent years, however, OSM has undergone a startling evolution. Since 2014, private companies have flocked to the service not just to get data but to provide it. As corporate influence grows on the platform, OSM users are increasingly realizing that their location data likely contains biases (some of which can be detected with the browser tool Crowd Lens). If Wendy’s were to rely on US census data and geospatial datasets from OSM to make decisions about site selection, they’d run the risk of misreading the market landscape and wasting time and money.
In addition to corporate bias, OSM suffers from other limitations as well. Chief among these is the fact the platform does not conduct systematic quality checks on its data. Like other free datasets, this can make it necessary to spend precious company resources on cleaning, deduplicating, and otherwise fixing it. What’s worse, businesses that use this data without the personnel to realize just how imprecise it is risk making mostly decisions about site selection, expansion, and logistics based on faulty information, potentially miring themselves in years of suboptimal performance.
Second, the accuracy of open-source data varies considerably across regions; international locations in particular often have low geospatial data integrity. While projects like Missing Maps have attempted to address this issue, it remains risky to develop location intelligence—particularly in the international context—on the basis of OSM’s VGI.
Weigh the pros and cons of different data sources
Ultimately, businesses hoping to build strategies on open-source data should consider the long-term financial picture. While open-source data is free, it can cost a great deal of capital in the long run. For some companies, especially those with robust engineering teams, these costs might be acceptable. For others, however, it might be wiser to dedicate spend to an initial data purchase to avoid ROI headaches down the line.
Geoff Michener is the CEO and co-founder of dataplor.