InfoSight Partners
Business services that create value through focused insight from actionable information
Collateral

The next big data management challenge

By Mark Albala

July, 2008

Introduction

The world of data management, including data warehousing and analytics is in the midst of going through an upheaval.  The idea of storing every shred of information in a data warehouse, pre-organized and summarized for easy consumption is about to be expunged and replaced by something much more intelligent than our current processes provide for.

This writing is about what that something is and what needs to be done to prepare for the brave new world of business intelligence and analytics.

Our short term challenges

The disciplines of data warehousing, business intelligence and predictive analytics were originally architected as a means to provide a small amount of insightful information to a small audience at a low frequency. We have eliminated the core architectural constraints over the past twenty years through incremental enhancements, at a cost of complexity in process and technology that needs to be unraveled, and in some instances replaced, for meeting the medium and long range needs of organizational stewards.   

Data Volumes continue to grow challenging every constraint

As stated in “The End of Theory” (Wired, July, 2008), Google currently processes one petabyte (one million gigabytes) of information every 72 hours.  Much of this traffic is web based and comprises the electronic signature consumers of Google products leave through their clicking within the family of Google products. 

RFID, or miniscule radio transmitters requiring no external power source to operate, are already beginning to appear in the physical world.  It is expected that these miniscule transmitters will find their way into everything on our person once their use becomes economically feasible (yes, cell phones, credit cards, frequent buyer fobs, clothing, etc.), and will each chirp and electronic signature which can be associated with a person .  Collected by receptors located in stores, cell phone towers, traffic and street lights and other locations, several orders of magnitude more data will be available about each person from these RFID transmitters, if it is collected, structured and used properly.  If there are only two orders of magnitude more data available from RFID than there is from web logs, then expect to process one petabyte of data every 45 minutes. 

Other similarly volume challenging data sources, such as Web 2.0 and the emerging Web 3.0, will:

  • bring your ability to synthesize, integrate, validate and publish information consumed  within and into your organization to a screeching halt,
  • your organizational stewards’ ability to make sense within meaningful timeframes the onslaught of the ensuing data deluge
  • and your exposure due to the financial outpouring of storage, computing power, energy consumption costs and people expenditures necessary to attempt keeping pace with the data onslaught that will increasingly become a noticeable component of organizational cash flow.

A consequence of data volumes

The initial concept of business intelligence managed monthly or weekly summarized data and promised that consumers of this data, using these implemented capabilities, could find what they wanted within three or four clicks of the mouse.  With the original volumes and publication frequencies of data this was very achievable.  However, over the past twenty years we have challenged every of these foundation tenets and have figured out ways to accommodate much more data published much more frequently to a much wider audience with significantly more sophisticated needs.  Unfortunately, the underlying premises of our data publication and data management processes have not matured sufficiently to keep pace with the onslaught of changes, and the basic underlying core architecture assumptions are still buried under a mound of complexity.  Without some foundational enhancements, we will be unable to sustain the anticipated changes resulting in behemoth data stores continually published and whose structure is continually enhanced.

One of the foundational changes is significantly more automated intelligence managing the quality, lifecycle, mapping, synthesis and integration of data.   

In the past we have tried with limited success to direct the focus of knowledge workers through the use of intelligent agents and exception reporting techniques.  These and emerging focus directing techniques will be mandatory going forward.

This time its viability at stake

In 2004, Nicolas Carr stirred the technology community with his publication, “Does IT Matter”.  In reality, though, for IT to matter, there must be timely delivery of actionable, relevant and trustworthy information to those who can use it to impact the bottom line for there to be any financial basis for further strategic investments in technology.  The speed of communications in a global economy requires that the technology arm of companies deliver actionable, timely, relevant and trustworthy information at a pace many are ill equipped to do. 

Data & Media Longevity Takes Center Stage

Historical media is readable for thousands of years, but much of the data created this century is not.  As we worry about business continuity and data longevity issues, the media onto which we store data is of tremendous importance.  To put this in perspective:

Media

Age of media

Attributes

(J=Positive, L=Negative)

% 2010’s

 

 

Production

Storage

Retention

Volume

Egyptian Hieroglyphics

3000 – 5000 years

L

J

J

<.1%

Pre-gutenberg publication

500 – 600 years

L

K

J

<.1%

Gutenberg age publication

300 – 500 years

K

J

J

<.1%

Punch cards and Paper tape

40 – 60 years

J

K

J

<1%

Magnetic Drums

40 – 50 years

J

J

L

<1%

Removable Disk Subsystems

30 – 40 years

J

K

L

<2%

Reel to Reel Tape

5 – 15 years

J

J

L

<5%

Floppy Disk

5 – 15 years

J

K

L

<5%

 

While media used for historical publication of data are absolutely ill equipped to handle the volumes of data now commonplace, they did not have the longevity issues current media types suffer from.  And just as 8 Track tapes, 45 RPM records and CDs and DVDs stand the chance of becoming historical novelties not capable of being readable when they are needed.

Data quality takes center stage

During the past 19 – 24 months, as a consequence of increased data and shortened publication cycles, the ability to use the adopted techniques to identify and remediate data anomalies in the timeframes demanded by organizational stewards has become impotent, and has been focused as a major organizational initiative for many companies. Unfortunately, with the acceleration of available data and the demand for accelerated publication frequencies, the ineffectiveness of our antiquated data validation techniques is about to become much more noticeable, and is a cornerstone to publishing trustworthy data to knowledge workers.    

A key perspective companies should be focusing on is what percentage of market opportunities were either not identified in time or lost because data quality and the need to  validate data prior to its strategic use.  With the increased speed at which global information transparency is reached, the need to identify and act on short lived market advantages will be a major investment justification point for data stewardship programs, initially in information intensive industries.    

While most practitioners will agree that publishing what is deemed high quality information is mandatory, many are lacking the ability to develop a sound return on investment model required to fund the necessary software and the people costs to define and administrate the processes required for a data stewardship program.  Recent surveys seem to indicate that people are gaining traction on their data quality initiatives, but there is a long way to go.  While compliance and regulatory obligations of companies have increased the awareness of data quality techniques and the need for adopting a data governance program, the technology and people investments and inability for technologists to communicate the benefits of such programs have negatively impacted the successful adoption many data governance programs. 

Availability of better data quality is rarely an approvable investment strategy because like many other technology investments, communicating improved data quality without communicating its use in business processes that positively improve organizational cash flow as a result of improved data quality yields no perceived measurable benefits.   

One of the key problems plaguing many organizations are that while data volumes have increased exponentially and are expected to continue to do so, the practices they use to validate information have not.  Many organizations have realized the importance of enhancing their ability to validate the quality of published information, not through long processes which accommodate the spot checking of information across the varied published dimensions, but rather through an automated process that uses a baseline of expectations as a means to identify what changed coupled with a process and value proposition that engages those capable of quickly identifying and dispensing with real data anomalies.  Currently being tackled through data stewardship, governance and master data management programs, there is still a long journey before our efforts are widely accepted as producing the de-facto source for trustworthy information.

Publication Processes

The data publication processes that have been adopted as best practices utilize a batch mentality to collect, synthesize, validate and publish information for consumption by knowledge workers. 

There are several challenges to the process generally followed today, these being:

Common Complaints

Current State

Future Goal

The publication process is too often described as a black box with no visibility to the data lineage

A complex batch process seldom provides the necessary audit trail required for stakeholders to feel comfortable about data derivation

Stakeholders must feel comfortable with the lineage of data so they do not need to validate it to gain a necessary level of comfort prior to its use

The sheer rapid growth of data warehouses makes it difficult if even possible to validate data within proscribed publication windows

Many companies have armies of workers execute queries to validate that data has a level of cleanliness, but is unable to identify data issues not previously encountered

An early warning system to identify and engage those capable of validating system recommended priority of suspect data and remediating actual defects

The global economy and the efficiency of the internet has shrunk the time available to take action before opportunities are common knowledge

The processes employed by many organizations are well equipped for limited amounts of data published infrequently, a description that is becoming less common

A continuous validation process that delivers highly reliable, trustworthy and relevant data and fosters the ability to take decisive collaborative action is required

The cost of validating data is mushrooming because of the rapidly accelerating volumes of data

The current processes are highly manual and have not been significantly upgraded since limited infrequently published information was the norm

A highly automated process that employs people when necessary is required

The importance of Context

As the publication frequency of data increases, the importance of context increases.  Context is often locked in textual documents which are indexed in a portal in many cases.  Unfortunately the portal in many organizations is organized differently than the data warehouse, which leads to the challenge and frustration of many knowledge workers of not being able to utilize the context associated with accessed data.

In order to be able to synthesize the integration of context with data, the organization of these two sources of knowledge must be consistent.  While this sounds simple and logical, there are challenges that organizations must tackle for information made available from textual information published internally, through the supply chain and external to the organization to be consistently organized with the contents of the data warehouse.

The Value of Technology

 The investment in technology has increased to be a significant cost component in many organizations.  This increased cost is due to the automation of many functions, thereby accelerating their execution to meet the demands of the global electronic community and the computing costs associated with the storage of data used in these functions and consumed by knowledge workers.

In 2006, the Environmental Protection Agency estimated that 1.5% of the overall consumption of energy in the U.S. was being expended on data centers.  With the escalating costs of energy, the cost of keeping data centers operating is similarly escalating.

Getting quite a bit of attention at about the same time was Nicolas Carr’s book asking if IT matters.  Clearly the accelerating IT investment matters only if it could help companies improve their profits, achieved by generating revenue or removing costs not possible without the IT investment.

At the heart of the revenue stream is accurate, timely, relevant and trustworthy information that will positively impact your revenue value chain.  For the technology arm of companies to matter and receive the attention and financing necessary to remain viable, they must be recognized as significantly contributing to the business processes that:

·          increase revenue,

·          enhance operational efficiencies,

·          support a multitude of critical applications that help predict the outcomes and reduce the risk of time sensitive decisions, and

·          meet the demands of increased regulatory and compliance reporting needs that have greatly reduced delivery constraints accustomed to by many organizations and required to remain viable in many industries.

 

Assuming  that cost take out has been cherry picked by IT over the past 20 years, technology investments must be justified through revenue enhancement programs, which is achievable only if the information is used before the playing field is leveled by global information transparency.  This requires that organizational stewards have the ability and faith in the published information to take swift fact based action, and will require technology to be able to deliver just in time, relevant, trustworthy information that organizational stewards do not feel the necessity to validate prior to its use.

Business Continuity and the effect of communicated misfires

One of the consequences of the highly accelerated global communications infrastructure is that while business continuity and data losses may not have actually increased, the knowledge of their existence and the global acknowledgement of business continuity misfires and data leakages has an adverse impact on those suffering these misfires.  Before remediative action could be taken at countless organizations, the facts surrounding the leakage or loss of information becomes public knowledge in electronic and print media.  In many cases such communications have had adverse impacts to the value of these organizations in the capital markets.

An unintended outcome of the global communications network is that companies must be able to identify and act on business continuity and data loss issues at a highly accelerated pace.  This requires highly tuned processes and a rock solid business continuity program to protect the data assets of an organization.  Expect the cost of business continuity programs to accelerate, furthering the need for companies to become very serious about their information lifecycle programs.

A roadmap for action

Companies need to map a course from their current tools and practices to one that delivers highly prioritized, just in time, relevant, trustworthy information to organizational stewards.

The target environment is one that is continually populated, continually enhanced to meet the data needs as dictated by the global marketplace, and continually culled to move priority information to the spotlight for consumption by knowledge workers.

The Publication Processes

The first great challenge is to redesign the publication processes from a process devised to run at discrete intervals to one that continuously serves data for consumption to knowledge workers.

The processes used to validate the contents of what is about to be published for consumption by knowledge workers must be much more automated than it currently is.  Intelligent profiling techniques and agents will determine, using a variety of sources, which data is valid, which data to publish for consumption by knowledge workers, which data to publish as supporting information and which data to place in short term archives and which data to discard.

Data quality and the handling of data as an important organizational asset

The recognition that data is an important asset of an organization that requires a certain maintenance and care to increase its value through programs devised to ensure its trustworthiness should be initiated now.  The recognition of data as an asset will assist in the formulation of program valuations and will also assist in maintaining the momentum required of organizational stewards participating in programs devised to ensure trustworthy data.

Business Continuity and the handling of data as an important organizational asset

The protection of data from leakage and business continuity misfires is identified and communicated globally regularly as an outcrop of the highly efficient global communications infrastructure.  Failure to protect data assets has on more than several occasions had adverse effects on the valuation of an organization in the capital markets. 

Companies must identify the valuation of the risk exposure and its impact in their market capitalization, and protect this risk through their business continuity and data protection programs.  Expect funding to be dramatically increased to protect the data assets of an organization.

Keeping data crisp and relevant

What many organizations have difficulty doing today is ensuring the contents of data used to gain insight and derive fact based actions crisp and relevant to the needs of the organization.  In order to publish information relevant to the enlightened enterprise, it is imperative that those publishing information understand:

·         what information is relevant to the organization,

·          what information was at one point relevant but is now just interesting facts

·         And what information not available is required to derive fact based decisions

A major role of those tasked with publishing information for use by knowledge workers will be to understand what is of importance to the organization, be able to quickly categorize information as to its usefulness in the current decision making environment, manage its placement through an orchestrated information lifecycle, and enrich data subjects insufficient to derive currently relevant fact based decisions.

The Discovery Processes

The data models accessed through tools used by knowledge workers will be a much more fluid representation of data than possible today.  The ability to map new data into data models used for knowledge workers to gain insight will need to be a much simpler and faster process than what exists today.  

Architecture

The architecture supporting the publication and use of data warehouses has been matured over the past 20 years and has grown into a rather complex environment.  It is time to simplify the complex infrastructure used to gain insight from data. 

Delivering business intelligence, data warehousing and predictive analytics through an architecture that simplifies, rather than complicates the discovery and publication processes will be central to the deliverables of technology organizations.  Expect to see the introduction of intelligent agents that gain their cues from profiling techniques.

The Analytical tools

Today, the toolsets used to analyze data are constructed on the underlying architecture that once served limited amounts of data infrequently to a small population of users.  While there have been many enhancements to the technology stack utilized to increase the sheer volume of data, the size of the user population and the publication frequency, it has been done at a cost of significantly complicating the technology stack utilized for analysis.

To be able to meet the demands of knowledge workers and be responsive to their needs over the next several years, there is an overhaul of the technology stack.  This overhaul has already begun to appear in the marketplace.  There are both technology and process modifications to implement what is being coined BI 2.0 in the industry.

Assumptions used in this article

There are a few assumptions that are core to this writing, these being:

·         The amount of information available for analysis will continue to expand. 

·         The costs of storing information will continue to be a significant administrative expense, with escalating people and power costs outstripping shrinking physical storage costs. 

·         The time in which information transparency is reached is rapidly shrinking, thanks largely to the internet and the global communications network. 

·         What is an outlier practice in terms of volumes of data, data publication frequencies, or generally leading edge best practices will be commonplace within five to seven years, just at the edge of the planning horizon.

Web Hosting Companies