Managing Unstructured Data with Big Data Notions

Data is traditionally structured. It makes the data easy to read and simple to consume. Are those statements actually true? I have a hard time accepting that perception. What if I told you we collectively perceive data as traditionally structured because we select to ignore unstructured data? All that remains is the structured data and because that turns into the primary focus, it becomes natural to claim data is traditionally structured. It makes the data easier to read and simpler to consume.

Unstructured data is defined as information that does not have a pre-defined model or is not organized in any sensible manner. This is not a new phenomenon. However, the sheer quantity of unstructured data is on the rise across all industry verticals and there is growing interest in the ingestion of that data. One of the largest examples of this is in the healthcare space. Take a minute to think about all the pre-printed medical forms, questionnaires, insurance forms, procedure and diagnosis forms, and claim forms you are exposed to as a patient or guarantor. That is what you see. Those same forms can exist in a variety of formats and layouts based on the insurance company, hospital or medical facility, or the state or locality where the services are performed. It may be a single field variance or a completely different looking form altogether. Whether typed into an electronic form or scanned into a system as an image, the data is likely unstructured making it nearly impossible to develop software to read, recognize and accurately ingest all of the data all of the time.

Instead of setting unstructured data aside, we need to use technology to reduce the constraints and structure the data. This will allow for the consumption of more data, which is really what enterprises are after. By understanding the data, you are creating a modern architecture ideally easy to use, repeatable, and secure. What could you accomplish with more data? What could society accomplish with more data, specifically in the healthcare space? Could we prevent outbreaks? Could we cure disease?

All enterprises deal with unstructured data. Some have more than others but unstructured data does exist everywhere. There are many methods of thinking about unstructured data, getting it into a structured format, and ultimately loading it into the system or data warehouse. It can be done manually with users reading and inputting the data, it can be done electronically with extract, transform, load-type processes to deal with most of the data, or it can be a combination of electronic and manual effort. Making the process as electronic and systematic as possible will be very helpful but remember it is nearly impossible to consume 100% of the data this way. Some manual effort will still be required when a form field is blank or has been moved to the next box, for example.

You need the right challenges while making the decision to ingest and make sense of unstructured data. It is important to know or at least question what you are possibly missing or what you can possibly gain from this additional data. Do not go blindly into the effort of managing unstructured data. Also, do not get caught up in the immediate issues or only a few use cases. Architect the process for a long term and comprehensive build out. You may know the challenges you are facing today but consider what challenges you may face down the road. Regulation, security, and technology are examples of ever-changing items in our industry that may present future challenges. Finally, data governance must be addressed. This is particularly true in the healthcare industry. The days of only a few fields being deemed personally identifiable information (PII) are fading away. If you are dealing with healthcare data, you are better off considering everything as PII. Secure access to the data on a need-to-know basis or at least with the vision that not everyone requires access to everything.

The concept of ingesting unstructured data is relatively new in the debt collection industry. Much of what I have presented crosses the line between what is considered data integration and what is considered Big Data. Managing unstructured data requires integration processes but understanding why unstructured data is important is a Big Data thought. Regardless of your approach, plan to pay for the sins of the past. Not capturing all the data and randomly dropping data into fields that “look” good may end up skewing your results as you ingest unstructured data.

* This article is also published by Collection Advisor

Advertisements

Announcing! The Big Data & Integration Summit NYC 2013

Actian Corporation and Emprise Technologies are co-hosting The Big Data & Integration Summit on September 26, 2013 in NYC and invite CIOs and IT Directors to attend and join in the conversations. #BDISNYC13 This event is free and features a fast-paced agenda that includes these topics and more: 

Register Now for the Big Data & Integration Summit NYC 2013
Register Now for the Big Data & Integration Summit NYC 2013

Additionally, attendees will join our panel of experts for a round-table discussion on Big Data & Integration challenges facing CIOs now. Talk with Actian Chief Technologist , Jim Falgout, about Hadoop and Big Data Analytics and more.

As CEO of Emprise Technologies, I’ve seen just about every cause there is for integration project failure. Often, there is more than one issue slowing down the project, sometimes a confluence of events – a periodic “perfect storm” develops, which derails integration projects and causes failure. I’m teaming up with Actian’s Chief Technologist, Jim Falgout to share the secrets we’ve learned for ensuring data integration and big data project success.

Don’t miss out on the opportunity to be part of the Big Data & Integration Summit NYC 2013. Register Now! Do you have any topics to suggest for the Summit? Provide us with your comments below. This is YOUR Summit!

Summit Agenda

Register Now for the Big Data & Integration Summit NYC 2013
Register Now for the Big Data & Integration Summit NYC 2013

 

“The Cost of Poor Data Management”

It is surprising that data quality is still a concept that is viewed as a luxury, rather than a necessity. As an unapologetic data quality advocate, I’ve written white papers and blog posts about the value of  good data management. It takes the efforts of many to change  habits. In her blog post, The Costs of Poor Data Management, on the Data Integration Blog, Julie Hunt breaks down the impact data quality has on business.

Here’s an infographic on the cost poor data quality can have on business.

Global research - Bad customer data costs you millions

She points out that the areas of data quality deserving the greatest focus are specific to each organization. If you read my post, “Avoiding Data Quality Pitfalls”,  you know that I’m a proponent of good data governance. Update early and often. My top four suggestions are:

  • Translation Tables
  • Stored Procedures
  • Database Views
  • Validation Lookups, Tables, and Rule

What are yours? Read Julie’s post, and send me your comments.