Managing Unstructured Data with Big Data Notions

Data is traditionally structured. It makes the data easy to read and simple to consume. Are those statements actually true? I have a hard time accepting that perception. What if I told you we collectively perceive data as traditionally structured because we select to ignore unstructured data? All that remains is the structured data and because that turns into the primary focus, it becomes natural to claim data is traditionally structured. It makes the data easier to read and simpler to consume.

Unstructured data is defined as information that does not have a pre-defined model or is not organized in any sensible manner. This is not a new phenomenon. However, the sheer quantity of unstructured data is on the rise across all industry verticals and there is growing interest in the ingestion of that data. One of the largest examples of this is in the healthcare space. Take a minute to think about all the pre-printed medical forms, questionnaires, insurance forms, procedure and diagnosis forms, and claim forms you are exposed to as a patient or guarantor. That is what you see. Those same forms can exist in a variety of formats and layouts based on the insurance company, hospital or medical facility, or the state or locality where the services are performed. It may be a single field variance or a completely different looking form altogether. Whether typed into an electronic form or scanned into a system as an image, the data is likely unstructured making it nearly impossible to develop software to read, recognize and accurately ingest all of the data all of the time.

Instead of setting unstructured data aside, we need to use technology to reduce the constraints and structure the data. This will allow for the consumption of more data, which is really what enterprises are after. By understanding the data, you are creating a modern architecture ideally easy to use, repeatable, and secure. What could you accomplish with more data? What could society accomplish with more data, specifically in the healthcare space? Could we prevent outbreaks? Could we cure disease?

All enterprises deal with unstructured data. Some have more than others but unstructured data does exist everywhere. There are many methods of thinking about unstructured data, getting it into a structured format, and ultimately loading it into the system or data warehouse. It can be done manually with users reading and inputting the data, it can be done electronically with extract, transform, load-type processes to deal with most of the data, or it can be a combination of electronic and manual effort. Making the process as electronic and systematic as possible will be very helpful but remember it is nearly impossible to consume 100% of the data this way. Some manual effort will still be required when a form field is blank or has been moved to the next box, for example.

You need the right challenges while making the decision to ingest and make sense of unstructured data. It is important to know or at least question what you are possibly missing or what you can possibly gain from this additional data. Do not go blindly into the effort of managing unstructured data. Also, do not get caught up in the immediate issues or only a few use cases. Architect the process for a long term and comprehensive build out. You may know the challenges you are facing today but consider what challenges you may face down the road. Regulation, security, and technology are examples of ever-changing items in our industry that may present future challenges. Finally, data governance must be addressed. This is particularly true in the healthcare industry. The days of only a few fields being deemed personally identifiable information (PII) are fading away. If you are dealing with healthcare data, you are better off considering everything as PII. Secure access to the data on a need-to-know basis or at least with the vision that not everyone requires access to everything.

The concept of ingesting unstructured data is relatively new in the debt collection industry. Much of what I have presented crosses the line between what is considered data integration and what is considered Big Data. Managing unstructured data requires integration processes but understanding why unstructured data is important is a Big Data thought. Regardless of your approach, plan to pay for the sins of the past. Not capturing all the data and randomly dropping data into fields that “look” good may end up skewing your results as you ingest unstructured data.

* This article is also published by Collection Advisor


Data Integration Dos and Don’ts

Actian CTO of Cloud Technology Partners, David Linthicum, recently discussed the Data Integration Dos and Don’ts.

In this article David discusses that many enterprises deployed some sort of data integration technology within the last 20 years. While many enterprise insiders believe they have the problem solved, most don’t. His advice? There needs to be a continued focus on what the technology does, and what value it brings to the organization.

Data integration is not something you just drop in and hope for the best. There needs to be careful planning around its use. IT is the typical choice to do the planning, select the technology, and for ongoing operations.

However, the need for data integration typically comes from outside of IT. Those who understand that data should be shared between systems, as needed and when needed, in support of core business processes, are typically the ones crying for more and better data integration technology. IT responds to those requests reactively.

David continues to explain that now things are changing more quickly than they have in the past, including new impacts on IT as well as end users. Specifically, these changes include:

  • The use of public cloud resources as a place to host and operate applications and data stores. This increases the integration challenges for enterprise IT, and requires a new way of thinking about data integration and data integration technology.
  • The rise of big data systems, both in the cloud and on-premise, where the amount of data stored could go beyond a petabyte. These systems have very specialized data integration requirements, not to mention the ability for the data integration solution to scale.
  • The rise of complex and mixed data models. This includes no-SQL type databases that typically serve a single purpose. Moreover, databases are emerging that focus on high performance, and thus need a data integration solution that can keep up.

To support these newer systems, those who leverage data integration approaches and technology have more decisions to make. Indeed, these can be boiled down to some simple dos and don’ts.

Do create a data integration plan, and architecture. No matter if you have existing data integration solutions in place or not, you need to consider your data integration requirements, which typically include lists of source and target data stores, performance, security, governance, data cleansing, etc.. This needs to be defined in enough detail that those in the IT and non-IT organization can both understand and follow the plan. This should also include a logical and physical data integration architecture, as well as a detailed roadmap so the amount of ambiguity is reduced.

Do allocate enough budget. In many cases, there are just not enough resources focused on the data integration problem. If we do develop a plan, the tasks and technology in that plan need to be funded.  Lack of funding typically means data integration efforts die the death of a thousand cuts, and the data integration solutions don’t solve the problems they should solve. That costs far more than any money you think you’re saving.

Don’t take the technology for granted. Many enterprises believe that most data integration solutions are the same, and don’t spend the time they need should to evaluate and test data integration technology. Available data integration technology varies a great deal, in terms of function and the problem patterns they can address. You need to become an expert of sorts in what’s available, what it does, and how it will work and play within your infrastructure to solve your business problems.

Don’t neglect security, governance, and performance. Many who implement data integration solutions often overlook security, governance, and even performance. They do this for a few reasons. Typically, they lack an understanding of how these concepts relate to data integration, and/or they lack an adequate budget (see above). The reality is that these are concepts that must be baked into the data integration solution from logical architecture to physical deployment. If you miss these items, you’ll have to retrofit them down the line.  This is almost impossible, certainly costly, and let’s not forget the cost of the risk you’ll incur.

Linthicum believes some of this seems obvious, most of what’s stated here is not followed by enterprise managers when they define, design, and deploy data integration solutions and technology. The end result is a system that misses some of the core reasons for deploying data integration in the first place, and does not deliver the huge value that this technology can bring.

The good news for most enterprises is that data integration technology continues to improve, and has adapted around emerging infrastructure changes, including use of cloud, big data, etc..  However, a certain amount of discipline and planning must still occur.

The Big Data & Integration Summit was a Success

he Big Data & Integration Summit was a success and our presentations are now available to the public for viewing.

Emerging IT Trends – The Age of Data

Big Data is Big Business
Big Data is Big Business

These are truly exciting times! The volume and velocity of data available to every business is astounding and continues to grow. IT industry leaders are talking about where technology is going, what the future holds and the impact all of this will have on the world.

Robin Bloor took a minute to review the path to the present, in his guest post “The Age of Data”, on the Actian blog this week, before revealing the vision he and IT Industry thought leader, Mike Hoskins have of the future for data.

“Mike Hoskins, CTO of Actian (formally, Pervasive Software) suggested to me in a recent conversation that we have entered the Age of Data. Is this the case? ” Bloor begins his post with a review of history. “The dawn of the IT industry could certainly be described as the Age of Iron. Even in the mainframe days there were many hardware companies.” I agree. In the past, the focus was on the machines and what they could do for humans.

Bloor continues, “Despite the fact that computers are only useful if you have applications, the money was made primarily from selling hardware, and the large and growing companies in that Age of Iron made money from the machines.” You can guess the monicker Bloor gives the next phase of IT history: “The Age of Software”. The volume of databases and applications available for organizations to buy exploded. And that got messy. Lots and lots of file types, formats, languages, programs led to multiple versions of records and interoperability nightmares.

What’s next? Bloor suggests it’s the Age of Data. It’s about the data and the analytics it can provide us. This is the Cambrian explosion that will be one of the primary topics discussed at the Big Data & Integration Summit NYC 2013. Actian Chief Technologist, Jim Falgout and I will present our views on emerging trends and lead a roundtable discussion with other industry leaders about the impact all of this will all have on business. I invite you to join what promises to be a lively conversation and attend the Summit.

Based on feedback from  industry leaders and customers, the Emprise Technologies and Actian teams have created a handful of sessions designed to deliver best practices that IT professionals can take home and use immediately to improve IT project success. These include “How to Win Business Buy-in for IT Projects”, “Avoiding the Pitfalls of Data Quality” and “Creating Workflows That Ensure Project Success”. I hope you’ll come join us. If you can’t make to New York, we’re planning to take the Big Data & Integration Summit on the road, so leave us your requested cities and topics in the comments below. We look forward to hearing from you.

Announcing! The Big Data & Integration Summit NYC 2013

Actian Corporation and Emprise Technologies are co-hosting The Big Data & Integration Summit on September 26, 2013 in NYC and invite CIOs and IT Directors to attend and join in the conversations. #BDISNYC13 This event is free and features a fast-paced agenda that includes these topics and more: 

Register Now for the Big Data & Integration Summit NYC 2013
Register Now for the Big Data & Integration Summit NYC 2013

Additionally, attendees will join our panel of experts for a round-table discussion on Big Data & Integration challenges facing CIOs now. Talk with Actian Chief Technologist , Jim Falgout, about Hadoop and Big Data Analytics and more.

As CEO of Emprise Technologies, I’ve seen just about every cause there is for integration project failure. Often, there is more than one issue slowing down the project, sometimes a confluence of events – a periodic “perfect storm” develops, which derails integration projects and causes failure. I’m teaming up with Actian’s Chief Technologist, Jim Falgout to share the secrets we’ve learned for ensuring data integration and big data project success.

Don’t miss out on the opportunity to be part of the Big Data & Integration Summit NYC 2013. Register Now! Do you have any topics to suggest for the Summit? Provide us with your comments below. This is YOUR Summit!

Summit Agenda

Register Now for the Big Data & Integration Summit NYC 2013
Register Now for the Big Data & Integration Summit NYC 2013


The Data Flow Architecture Two-Step

Robin Bloor's Data Two Step
The Two Step Data Process

In his latest post on the Actian corporation hosted Data Integration blog, data management industry analystRobin Bloor laid out his vision of data flow architecture. He wrote, “We organize software within networks of computers to run applications (i.e., provide capability) for the benefit of users and the organization as a whole. Exactly how we do this is determined by the workloads and the service levels we try to meet. Different applications have different workloads. This whole activity is complicated by the fact that, nowadays, most of these applications pass information or even commands to each other. For that reason, even though the computer hardware needed for most applications is not particularly expensive, we cannot build applications within silos in the way that we once did. It’s now about networks and grids of computers.”

Bloor said, “The natural outcome of successfully analyzing a collection of event data is the discovery of actionable knowledge.” He went on to say, “Data analysis is thus a two-step activity. The first step is knowledge discovery, which involves iterative analysis on mountains of data to discover useful knowledge. The second step is knowledge implementation, which may also involve on-going analytical activity on critical data but also involves the implementation of the knowledge.” Read more->