Imperfection is something we have to deal with every day. We as people are far from perfect in all the things we do. No matter how hard we try it is always an interesting battle. But let us not go into the psycho analyses of perfectionism. How about the data world. We always strive to get our data as perfect as possible. Data Quality is on top of mind in every company. Logical if you want to trust your data to for the decision making process. However not all data can be perfect. The world of Big Data is often not so perfect. Let us explore the world of imperfect big data.
Imperfect big data
For companies that build their competitive advantage on big data analytics, data sources are everything. Everything to let their business run better than others. However everybody knows that if you are putting garbage into your high-powered analytic system, there’s not much value that can come out. But his garbage can also have great potential. Potential that is not yet discovered. Therefore Great big data strategists consider their sources very carefully and learn how to live with this uncertainty of Imperfect Big Data.
Usability of Imperfect Big Data
With the aid of recent big data analytic technology, many companies are for example combing social media feeds to conduct sentiment analysis. There is no possible way to get a high level of data quality from a Twitter, Facebook or LinkedIn feed. But it’s still very useful, given you’re clear on its confidence rating—a very important concept in profiling data sources. A confidence rating is an overall assessment (typically in percentage terms) of your data source’s quality. (also called the “what”) . Without the concept of a confidence rating, you’re likely to over-cleanse your imperfect big data. Instead of ignoring it, map the data through with a confidence rating that provides a disclaimer for what you’re analyzing.
In the days of clear and structured data, was the question what to do with imperfect data much easier. The structure and clarity allowed us to identify natural keys that signaled whether to insert, update, or delete. Now, with unstructured data, you must solve the same problem, but in a whole different way.
You can do this by identifying rules that classify your source data in a way that allows you to identify natural keys. These keys must be published so that subscribers, like the analytic system, can take action. It is important that you have clear communication between your change data capture system and your analytic system.