Commentary

Good Data vs Bad: How to Decide What to Keep and What to Discard

May 29, 2019May 28, 2019 by Josh Cohen

1. Define what “good” data is

The first step is nailing down a definition of and/or criteria for what “good” data actually means to your business. Simply put, “good” data is information that enables you to make smart decisions about your business and to generate actionable insights that you can immediately put into practice. However, for data to be actionable, it must be accurate.

For example, a retail brand might use location data for attribution purposes in order to see which forms of media are driving real-world visits to physical stores. They might then seek to optimize their media spend based on those visits. However, if the data on those visits isn’t accurate, they run the risk of spending money on the wrong types of media, day parts, or targeting the wrong segments. To say the least, that is not what anyone would call “useful” data.

But how do you know what data is accurate? What makes it accurate, and how do you measure it? The key is having a reliable underlying data source with clear and objective criteria for how to measure it.

When I think about the type of location data I can trust on the job, it’s all about the quality of the signals we’re able to collect. Beyond basic latitudinal and longitudinal (lat/lng) pings, I want to make sure we’re able to gather them within a consistent area and time frame to provide a level of confidence that a user actually “did stop” within a particular venue — as opposed to a ping resulting from just passing by. While this is just one example, by clearly defining the precision of the data set as user pings within a consistent area and time frame, I’ve enabled a system that ignores pings that are wrong and thus detrimental to my business goals.

2. Account for potential inaccuracies

There are formidable challenges in all areas of data collection that you need to consider and address before deciding which data sets are, in fact, useful and worth keeping. One of the biggest data accuracy issues I’ve encountered working in location technology, for example, is the technical challenge presented by GPS signals. GPS by nature is “noisy” and only accurate provided there’s a wide radius around each signal. This means that in dense or urban areas (New York City, shopping malls), it can be difficult to know if a ping is truly coming from where GPS says it is. The signal might bounce off other buildings, and there’s also the “verticality” issue of knowing what floor of a building a given signal is coming from — even if the lat/lng are accurate.

What hurts data accuracy for many people — and therefore leads to poor decisions — is simply a lack of awareness of the technical challenges faced by data collection methods and technologies, no matter where they’re coming from. That’s why it’s important to carefully dissect data collection methodologies of any source or vendor with which you’re working. Consider how the data was sourced, the reputation of the company with which you’re dealing, and the consent mechanisms in place. No data sources are 100% infallible; it’s critical to be aware of any gaps or inaccuracies well in advance of making any key business decisions.

3. Decide what to keep

So, you’ve clearly defined good data, selected your sources, and accounted for any potential inaccuracies that you might encounter. In the end, you’ll need to make the hard and fast decision of whether or not to actually keep, use, and analyze certain data sets to create actionable insights that will drive bottom-line results.

In my space of location technology, my team and I spent a lot of time trying to understand what are the best ways to discern between good and bad data by looking at the real-world, on-the-ground datasets generated in the first place. What I’ve learned is the longer the time period that these datasets are collected, the more accurate and reliable they are. Therefore, if data isn’t based on reliable, real-world collection methods, or are only for a short period of time, odds are you should strongly consider throwing it out. (In Foursquare’s case, we discard 80% of third-party data we receive that doesn’t fit our criteria for quality.)

You also want to cross-reference data with other sources. If you have a dataset that isn’t validated by more than one source, either find a way to confirm what you’re working with is accurate, or hold off on using it until you can. Make sure to take a look at each and every data source, specific publishers—whether or not the data is from an aggregator—and the frequency of updates as well.

It’s important to validate data sources on a regular cadence to determine if you can still trust the source and/or publisher, and if it fails your standards, have a system that automatically filters those publishers out of your panel. While you may not need to validate each source on a weekly basis, it’s critical to work with all of your data sources and/or vendors to come up with a suitable validation period as well as an efficient way of weeding out bad sources, preferably in an automated fashion.

We all want to make better data-driven decisions, no matter the industry or type of product. But it’s essential to fine-tune and correct your data usage strategy from the get-go by defining what “good” data means for your business and use cases, and by accounting for potential inaccuracies from various sources or vendors who provide you with data. Only then can you bring actual datasets into focus, systematically throw out what’s useless, and use the best available data to meet your business objectives.

As senior vice president of product, Josh leads product development across Foursquare, including our suite of enterprise products—Pinpoint, Attribution, and our developer tools—as well as our consumer apps. Previously, Josh was at Google, where he was the group product manager for Publisher Advertising Platforms and Business Product Manager for Google News, responsible for global product strategy, marketing, and publisher outreach. He was also vice president of business development for the consumer media team at Reuters Media and director of business development for SmartMoney.com, a joint venture between Dow Jones and Hearst. Josh holds degrees from the University of Michigan and Columbia Business School, where he graduated Beta Gamma Sigma.