Dr Mehmet Yildiz

A Simplified Guide To Big Data Lifecycle Management

2021-04-30

Big Data provides intelligence, enhances our capabilities, and creates new business opportunities.

https://img.particlenews.com/image.php?url=0Pc2D9_0ZY4tiE000

Many business organisations use Big Data generated from various sources such as transaction systems, media, and streaming data from the Internet of Things.

In this post, I introduce the lifecycle management of the Big Data process at a high-level and with simplified language gleaning from methods I used in my data solutions. The key roles in this process are data architects, technical data specialists, data analysts, and data scientists.

Big Data architects and specialists start solutions by understanding the lifecycle. They engage in all phases of the lifecycle. The roles and responsibilities may differ in different stages. However, they need to be on top of the life cycle management end to end.

Based on my experience, I introduce 12 distinct phases in the overall data lifecycle management, which can also apply to Big Data. I combined some relevant activities in a single phase to make it concise and easily understandable.

These phases may be implemented under different names in various data solution teams. There is no universal systematic approach to the Big Data lifecycle as the field is still evolving. For guiding purposes, I propose the following distinct phases in this area:

Phase 1: Foundations
Phase 2: Acquirement
Phase 3: Preparation
Phase 4: Input and Access
Phase 5: Processing
Phase 6: Output and Interpretation
Phase 7: Storage
Phase 8: Integration
Phase 9: Analytics
Phase 10: Consumption
Phase 11: Retention, Backup, and Archival
Phase 12: Destruction

These phases can be customised based on the need. They are not set in stone.

Foundations

In the data management process, the foundation phase includes various aspects. The most critical point in the foundation phase is understanding, capturing, analysing, and validating data requirements. Then comes the solution scope, including roles and responsibilities.

During the foundation phase, data architects prepare infrastructure and document technical and non-technical considerations. This document of understanding includes data rules in an organisation.

This phase requires a detailed plan facilitated ideally by a data project manager with substantial input from the Big Data solution architect and domain specialists.

A Big Data solution project definition report (PDR) can include planning, funding, commercials, risks, dependencies, issues, and resourcing. Project Managers author the PDR; however, the solution overview in this artefact is covered by the Big Data architects and specialists.

Data Acquirement

Data acquirement refers to collecting data.

Data can be obtained from various sources. These sources can be internal and external to the organisation. Data sources can be structured forms such as transfers from a data warehouse, transaction systems, or semi-structured records such as Web or system logs or unstructured such as media files consisting of videos, audios, or pictures.

Even though various specialists conduct data collection with the help of administrators, the Big Data architects can have a substantial role in optimally facilitating this phase.

Data governance, security, privacy, and quality controls start with the data collection phase. Thus, the Big Data architects take technical and architectural leadership of this phase.

Data Preparation

In the data preparation phase, the collected raw data is cleaned.

In this phase, data is rigorously checked for any inconsistencies, errors, and duplicates. Any redundant, duplicated, incomplete, and incorrect data are removed. The goal is to have clean and useable datasets.

The Big Data solution architect facilitates this phase. However, data cleaning tasks can be performed by data specialists trained in data preparation and cleaning techniques.

Data Input and Access

Data input refers to sending data to planned target data repositories or systems. Some common systems are CRM (Customer Relationship Management) system, a data lake, and a data warehouse. In this phase, data specialists transform the raw data into a useable format.

Data access refers to accessing data using various methods such as using relational databases, flat files, and NoSQL.

The Big Data solution architects lead the input and access phases. However, usually, a data specialist, with the help of database administrators, performs the input and access related tasks during this phase.

Data Processing

Data Processing starts with processing the raw data. Then, data specialists need to convert data into a readable format giving it form and context. After this activity, data analysts and data scientists can interpret the data by using data analytics tools.

They can use Big Data open-source processing tools such as Hadoop, MapReduce, Impala, Hive, Pig, and Spark SQL. A common real-time data processing tool is HBase, and a near real-time data processing tool is Spark Streaming.

Data processing also includes data annotation, data integration, data aggregation, and data representation.
Data annotation is labelling the data. Once data is labelled, it can be ready for machine learning.
Data integration aims to combine data in different sources and provide the data to consumers with a unified view of them.
Data representation refers to the way data is processed, transmitted, and stored. These three essential functions depict the representation of data in the lifecycle.
Data aggregation aims to compile data from databases to combined datasets to be used for data processing.

Data Output and Interpretation

In the data output phase, the data is ready for consumption by the business users. Data specialists can transform data into useable formats such as plain text, graphs, processed images, and video files.

The output phase states that the data is ready for use, thus sends the data to the next stage for storing. In some organisations, this phase is called data ingestion, aiming to import data for immediate use or future use to keep data in a database format.

The data ingestion process can be a real-time or batch process. Some commonly used data ingestion tools are Sqoop, Flume, and Spark streaming.

Data Storage

Once the data output phase is completed, data specialists store data in designated storage units. These units are part of the data platform and infrastructure considering non-functional aspects such as capacity, scalability, security, compliance, performance and availability.

The data platform infrastructure can consist of storage area networks (SAN), network-attached storage (NAS), or direct access storage (DAS) formats. Data and database administrators can manage stored data and allow access to the defined user groups.

Big Data storage includes technology stacks such as database clusters, relational data storage, and extended data storage.

The file formats such as text, binary, or other specialised structures such as Sequence, Avro and Parquet are considered in the data storage design phase.

Data Integration

In traditional models, once the data is stored, the process ends.

However, for Big Data, there may be a need to integrate stored data for various purposes.

Data integration is a complex process. Big Data architects design the use of various data connectors for the integration of Big Data solutions. There may be use cases and requirements for many connectors such as ODBC, JDBC, Kafka, DB2, Amazon S3, Netezza, Teradata, Oracle and many more based on the data sources used in the solution.

Some data models may require the integration of data lakes with a data warehouse or a data mart. There may also be application integration requirements.

For example, some integration activities may comprise integrating data with dashboards, tableau, websites, or data visualisation applications. This activity may overlap with the next phase, which is data analytics.

Data Analytics

Integrated data can be valuable and productive for data analytics.

Data analytics is a significant component of Big Data solutions. This phase is critical because of the business value generated by Big Data.

The commonly used tools for data analytics are Scala, Phyton, and R notebooks.

There can be a team responsible for data analytics led by a chief data scientist. The data architect has a limited role in this phase. Data architects must ensure the stages of the lifecycle are completed with rigour.

Data Consumption

Once the data analytics phase is completed, the data is turned into information ready for consumption. Consumers can be internal or external users.

Data consumption requires policies, rules, regulations, principles, and guidelines. The consumption can be based on a service provision process. Data governance bodies create rules for the provision of data.

The lead Big Data Solution Architect facilitates creating these policies, rules, principles and guidelines using an architectural framework.

Retention, Backup, & Archival

Critical data require to be backed up for protection. It is also an industry compliance requirement. There are established data backup strategies, techniques, methods, and tools.

The Big Data Solution Architect usually delegates the design of this phase to an infrastructure architect assisted by several data, database, storage, and recovery domain specialists.

Some data for regulatory and business compliance reasons may require to be archived for a defined period. Data retention strategy must be documented and approved by the governing body.

Data Destruction

There may be regulatory requirements to destroy a particular type of data after a period. These requirements may change based on the industries and organisations that own data.

https://img.particlenews.com/image.php?url=0IXlur_0ZY4tiE000

Conclusion

Big Data lifecycle management is a recursive process. Each solution can use a specific lifecycle process.

Even though many solutions follow a chronological order for the data life cycle management, some phases may overlap and can be done in parallel.

The life cycle proposed in this article is only a guideline. This proposed lifecycle management can be customised based on the structure of the data solution, unique data platforms, data solution requirements, use cases, industry compliance, and dynamics of the departments in an organisation.

Follow me to see more articles like this.

...

Expand All

Read in NewsBreak

Comments / 0

Add a Comment

Thomas Smith8 days ago

A large retail chain is shutting down hundreds stores after going through Chapter 11 bankruptcy

NewsByJoshua1 day ago

Hamilton Township, NJ10 days ago

An engine manufacturer is laying off 1,700 workers due to drop in sales

Fond Du Lac, WI5 days ago

State to open licensing applications for medical cannabis businesses beginning July 1

Kentucky State26 days ago

Unite Health Fined $300K for Unauthorized Insurance Sales

Washington State25 days ago

$95 Million for Healthy Homes: Housing for Medicaid Members

Morristown Minute9 days ago

Earth Quaker Action Team to Protest at Vanguard’s Headquarters on Wednesday

Malvern, PA25 days ago

Will CAR-T Cell Therapies Work for Treating Brain Tumors?

Dr Mehmet Yildiz28 days ago

Class action lawsuit says Wells Fargo customers lost over $160 Million

California State5 days ago

Senator Bernie Sanders Warns in New Report of Right-Wing Billionaire Plot to Sabotage Public Education in Order to Privatize It

Bucks County Beacon8 days ago

Pipersville Residents Held Hostage By Slaughterhouse Horrors

Bedminster Township, PA17 days ago

NJ AG Backs DEA's Move to Reschedule Cannabis

Morristown Minute1 day ago

The Synergy of Phentermine & GLP-1 RAs for Healthy Weight Management

Dr Mehmet Yildiz22 days ago

Minnesota Coffee Chain Caribou Coffee May Be Up for Sale, Reuters Reports

Minnesota State3 hours ago

Plea: Man's PPP & Unemployment Fraud - $209K Scam

Collingswood, NJ3 days ago

Spooked by price tag, billionaire cans under-construction luxury resort in Bondurant area

Bondurant, WY16 days ago

Opinion: Living unhoused proves an expensive pit to be in; Denver Basic Income Project provides exit

Denver, CO7 days ago

Denver program that gives homeless cash asks city for more money

Denver, CO7 days ago

Welcome to NewsBreak, an open platform where diverse perspectives converge. Most of our content comes from established publications and journalists, as well as from our extensive network of tens of thousands of creators who contribute to our platform. We empower individuals to share insightful viewpoints through short posts and comments. It’s essential to note our commitment to transparency: our Terms of Use acknowledge that our services may not always be error-free, and our Community Standards emphasize our discretion in enforcing policies. We strive to foster a dynamic environment for free expression and robust discourse through safety guardrails of human and AI moderation. Join us in shaping the news narrative together.

Comments / 0

Community Policy