A pile of computer keyboard keys, they are black and white - unstructured data concept
Feature

How to Prepare Data for Ingestion and Integration

7 minute read
Brian Carlson avatar
SAVED
Garbage in, garbage out as they say. However, poor quality data is no joke, today it costs the U.S. economy trillions.

There is an old saying that dates back to the early days of computing that has taken on new meaning today for companies awash in unstructured customer data from multiple disparate sources. That acronym is GIGO, or more commonly known as “garbage in, garbage out.”

First defined in 1957 by U.S. Army mathematicians, who declared that “sloppily programmed” inputs inevitably lead to incorrect outputs. To put it in modern language, the value you get out of a system is only as good as the data that is put into it. Poor quality data is no joke, today it costs the U.S. economy up to $3.1 trillion annually.

As customers have become increasingly digital-first in their behaviors, and the increase of networked devices at the edge of computing, data has exploded exponentially. Leveraging that data for business value and differentiation is how modern businesses will not just survive, but thrive. All of this new data resides in disparate silos, some digital and some physical, and needs to be brought together and integrated to make it useful for doing things like affecting the customer experience (CX) through personalization.

“Data is the fuel that powers many of the enterprise’s mission-critical engines, from business intelligence to predictive analytics; data science to machine learning. To be fully useful, data, like any fuel, must be abundant, readily available and clean,” said Moshe Kranc, CTO at Ness Digital Engineering.

The concept of GIGO is more relevant today than ever before. With data from multiple sources, all formatted differently, or not formatted at all, just dumping all that data into a data warehouse will only make a bigger pool of garbage data that can't be leveraged for value.

Getting data prepared for ingestions and integration into a data management platform like a customer data platform (CDP) is critical to being able to show ROI and results from the installation.

Related Article: What Is a Customer Data Platform (CDP)?

What Is Garbage Data?

Garbage data isn’t necessarily false data, but may be data that isn't accurate or relevant to your organization. Bad data may be duped, isn’t compiled correctly, or is missing key elements. It may in fact contain false information, or incomplete and irrelevant information. Garbage data can seriously affect a businesses performance and bottom line, so the importance of good best practices and processes regarding data collection and data management cannot be understated. A business that makes business intelligence (BI) decisions based on garbage data is making poor decisions, potentially wasting time, money and resources.

Regardless of the type of data warehouse or data management platform you plan to use, for data to be stored in a centralized database it needs to be ingested before it can be digested by the system.

The first step in getting data ready for ingestion is having the right skills and trained employees to be able to understand data ingestion and its related technologies and processes. These data management experts can ensure you have the right data pipeline developed to drive business value.

Related Article: What's the Road Ahead Look Like for CDPs?

What Is Data Ingestion?

Data ingestion refers to the process of transporting data from a variety of sources to a storage solution where that data can be accessed, stored, analyzed, and used. Data is typically stored in a data management solution like a CDP, data lake, or other type of centralized data management warehouse. Data ingestion is a critical layer in producing quality analytics to make data-driven decisions by. All types of downstream reporting and analytics systems need a constant flow of high-quality, consistent and accessible data.

How much of a burden can data ingestion be on a company? Collecting and cleansing the data reportedly takes 60 percent to 80 percent of the scheduled time in any analytics project.

Learning Opportunities

Data Ingestion can be performed in a few ways. The most common form of data ingestion is batch processing. This is used when real-time data is not critical, as data can be grouped and sent to the appropriate destination system. The other common form of ingestion is real-time processing, or streaming ingestion. This type of ingestion involves no grouping like batch processing, as data is sourced, manipulated, and loaded as soon as it’s created. Real-time ingestion is more expensive than batch processing, but may be necessary for analytics applications that require continuously refreshed data.

“Today, data has gotten too large, both in size and variety, to be curated manually. You need to develop tools that automate the ingestion process wherever possible,” said Kranc in a recent CMSWire article.

One of the challenges facing data ingestion today is the explosion of data sources in which the volume, velocity, and complexity has grown exponentially. With new sources constantly coming in from IoT devices and software platforms, this makes the future data ingestion process difficult to define and set processes for. One way to mitigate these challenges is to build out a robust analytics architecture that can manage such a high volume of complex data.

Further challenging business is as the data increases in volume and complexity, speed becomes a challenge, especially when it comes to dealing with real-time data processing. Companies may want to look at advanced technologies like auto scaling cloud data warehouses to help optimize performance for data ingestion pipelines.

After you have wrangled all that disparate data with an appropriate data pipeline ingestion strategy, you must integrate that data together so it can be leveraged for customer and business value.

Related Article: Should You Unbundle Your CDP?

What Is Data Integration?

Data integration refers to the process of combining data from a variety of sources to try and achieve the single, unified view of the customer.

“For marketers integrations are often the work assigned to the dev team, but this shouldn’t be the case,” said Maria Braune, Senior Product Strategist at Liveclicker. “Currently existing and potential new integrations offer pathways to new opportunities for marketers, and should be front and center as marketers look to become more relevant, and use their data in smarter ways.”

As the modern customer has become more digitally-savvy, and modern corporations have reacted by improving their data collection and technology stack capabilities. Modern companies need to integrate data from a variety of sources, from email tools like Mailchimp to payment tools like Stripe to analytics packages like Google Analytics. All these sources are data silos that need to be integrated.

Data integration methodology can be grouped into three general processes. First is manual data integration, which is unfortunately what it sounds like. Manually and laboriously copy-pasting fields, CSV uploads, and more. Next level is doing automated data integration using libraries with application programming interfaces (APIs). Finally there is engineered data integration which involves using custom APIs and webhooks to create data-flows and make integration easier.

One of the most popular data integration techniques is called ETL, or Extract, Transform and Load. Extract refers to the tool that extracts the data from the course through connectors or APIs. Transform is the process of standardizing data values, enriching them with deeper information, and ensuring consistency after it is integrated. The load stage refers to when the data is loaded into the central database and it can be used.

Looking Forward

Clean, or quality, data is the lifeblood of every modern company. Having data that is ready for ingestion and integration is a keystone step in ensuring your business is differentiating with up-to-date business intelligence on your customers. In fact, 95% of businesses cite the need to manage unstructured data as a core problem for their business.

GIGO, or garbage in, garbage out, a term from the very first days of computing, is more relevant today than ever. Quality data can provide your business more accurate insights and business intelligence to make decisions from, optimize business processes, help you understand your customers more completely, and tailor messaging and experiences based on customer needs. Bad, or garbage, data, can infect all your decision making with poor information.

About the Author

Brian Carlson

Brian Carlson is the Founder and President of RoC (Return on Content) Consulting, a digital content consulting and development firm. He has over 20+ experience as a digital leader and manager, specializing in digital transformation, content marketing, content management, content strategy, SEO and digital product development. Connect with Brian Carlson:

Main image: Adobe Stock