Big Data – Data Sources Make a Difference

October 24, 2013 by Jennifer Cobb

The ability to have data big, fast and varied all at once is relatively new.  It wasn’t so long ago that amassing the variety, velocity and volume that characterizes big data sets was prohibitively expensive.  With cheap storage, cloud everywhere and new tools, what was costly is now relatively cheap.  And when the cost drops, interesting things start to happen.

At least that’s the story we are hearing.  Leading analyst firm Gartner, whose primary customers are very large companies, recently produced a report on big data asking, how real is it?  The statistics are revealing.  While 64% of companies plan to deploy big data solutions, only 8% have actually done so.  This says that while interest is huge, it still remains largely untested.   The biggest struggle reported by companies was how to extract value from big data.  Many have started to amass the virtual haystack, but how do they find the needle?  A related question is, do they even have the right haystack?

Data Sources Remain Uniform

The vast majority of companies are using the easiest to gather data to amass volume.  This data resides largely in transaction and log data, as the following chart shows:

Garnter data types

While much of the promise of big data is about “unconventional” data sources combined with new technologies, the practice so far seems to be aggregating large amounts of data that is already flowing through production systems, such as transaction and log data.  Gaining ground is other machine-readable data such as social media, sensor data and emails.  This is a bit more unconventional from an enterprise analytics point of view,  but still relatively easy to capture by pulling from the right repositories.

What is largely missing from the big data picture is free form text and data from static documents.  In some industries, such as government, healthcare and utilities, information captured on forms, via PDF and even as handwritten documents, represent very large and valuable repositories of insight.

For example, many utilities have decades of forms that capture block by block, house by house level information about power, water and gas resources.  Similarly, many healthcare organizations have reams of valuable lab reports that could deliver valuable insights for disease research, if the data was properly de-identified and managed according to HIPAA standards.  As the Gartner report notes, “Healthcare's primary objective of improving process efficiency is addressed by leveraging machine data with one of the most opaque data sources — handwritten notes.”

The following chart offers an intriguing view of where a lot of valuable data exists, industry by industry, which could add significantly to big data projects.

Gartner Chart with Highlights


At Captricity, we excel at extracting data from static documents of all types, as indicated by the highlighted rows above.  The data we generate from your sources is highly accurate and retains clear data provenance, something that can get lost in many big data sources.  We are always happy to hear from you and discuss how we can help with your data-driven projects.


Product + Solutions

Stay up to date!

Sign up for our newsletter today to discover how Captricity can help you unlock valuable customer data—from handwritten forms and scans to faxes, emails, and mobile inputs.