It seems like wherever we turn these days we are reminded that we live in a data-driven world. Whether it is recommended items on an e-commerce website, detailed directions from a GPS device, exhaustive statistics while watching a golf tournament, or getting cooking instructions from a smart speaker, we see data used in a variety of new applications.
This phenomenon is especially true in the retail industry. Years ago, retailers began harvesting information collected on their e-commerce sites. Through improved system logging and data collection techniques coupled with advanced analytics, retailers were able to develop a clear and comprehensive view of customer behavior on their websites. Retailers could see products browsed, selected, substituted, and ultimately purchased. They could see the navigation paths customers followed and better understand when and why customers didn’t buy from them. This capability is in stark contrast to the information retailers collect about customer behavior inside a brick and mortar store. In the physical store, most retailers only know customers by what is purchased. If customers enter and leave without buying, retailers are literally in the dark as to why.
Fast forward to today. Internet of things (IoT) technologies enable retailers to collect information about shopping patterns, fixture interactions, and even associate engagement. Coupled with data captured through opt-in mobile applications and loyalty programs, retailers are gaining new insights into how customers shop their stores. This same data collection phenomenon is playing out across other industries such as manufacturing, financial services, insurance, and telecommunications.
However, most companies are ill-prepared to leverage new data in ways that drive business value. IT groups in these companies often try to leverage existing technologies to handle the new data types or build separate, disparate platforms to house the new data resulting in substantial integration challenges. The best analogy that comes to mind is trying to erect a building without a blueprint.
Enter the modern data strategy. After years of trying, we have found that one size does not fit all when it comes to managing data. Think about all the technology phrases such as “single version of the truth,” single view of the customer,” and “master data management.” In many cases, IT organizations tried to achieve these goals using physical instantiations of technology. Unfortunately, as data volumes grow and data types become more complex, it is readily apparent that multiple physical approaches are required to address the problems associated with the explosion of data.
In addition to physical data storage, there are many other considerations when developing an organizational data strategy to support the governance of today’s data-driven businesses. Some of these include:
- Governance metrics and measures – What are the desired outcomes from the data strategy? How will you measure your success?
- Data Update Frequency – What are the organizations requirements for collecting and storing data? This question includes the volume and processing speed requirements for data updates.
- Data Decision Frequency – Does the business need to recognize and action insights and patterns in real-time data?
- Data retention and archival – Does the organization have specific regulatory or compliance rules that dictate data retention? How often should data be archived?
- Data Volatility – Do organizational data requirements change often? Are new sources of information introduced on a relatively frequent basis? Are business models changing?
- Data Quality – Often coupled with Master Data Management, data quality entails the needs for real-time data cleansing, support for multiple physical instantiations of the same information, and data enrichment. Data quality is essential to support artificial intelligence and machine learning algorithms.
- Data Analytics – What forms of analytics are needed? Data requirements and capabilities vary significantly from descriptive, predictive, and prescriptive analytics. How critical is data lineage is building and understanding organization metrics?
- Business Support – How do business users consume data? Do business units have their own independent data analysis functions/roles?
- Sensitive Data – Does the organization have data that requires restrictive access? Are there levels of sensitivity that would preclude the movement of data to outside entities (such as business partners or cloud providers)?
- Business Integration – What are critical use cases and business models are dependent on data? Will data services be part of a new commercialized service?
An organization’s data strategy must encompass all of these elements to support current and future business needs. The information collected from these questions drives the adoption of processes and the selection of the technologies to manage the organization’s data assets.
While the organizational data strategies often vary based on business requirements, properly structured and forward-thinking data strategies share a common set of characteristics.
- Business-centric – Focusing data to drive desired business outcomes.
- Managed/governed – Providing controls that encourage proper data practices and effective without being complicated and bureaucratic.
- Scalable – Recognizing data volumes are growing. The needs of the organization tomorrow will be different from the needs of today.
- Agile/flexible – Accommodating rapid changes to organization data needs.
- Secure – Safeguarding corporate data assets and managing privacy concerns of all organization constituents.
- Self-service – Providing mechanisms for line of business analysts and other ad-hoc users to leverage data assets to drive business outcomes without requiring central IT intervention.
As mentioned previously, a data strategy must embrace the notion of multiple physical stores of data with a united logical view that simplifies access and use. To understand this requirement, we need to consider the evolution of data management platforms.
Relational databases have long been the workhorse for organizations. To supply the scale and performance needed for business analytics, relational databases added new capabilities such as massively parallel and in-memory processing. Even with these improvements, the table, row, column structure of relational databases struggled to keep pace with rapidly changing data standards.
These struggles opened the door for NoSQL databases. NoSQL databases are based on key-value pairs and come in a variety of forms. The most common NoSQL format is the Columnar database. Columnar, as the name suggests, orients data by columns, with each column containing a searchable key. Columnar stores are much more efficient for high input/output operations and are easily compressed to maximize storage.
Another popular form of NoSQL is the Document store. This form is ideal for semi-structured data. The third form of NoSQL is Graph. Graph databases are organized around relationships. Unlike relational databases that treat relationships as part of the database design (or schema), Graph treats relationships as data. This feature allows Graph datastores to traverse complex data relationships with ease.
The final form of NoSQL is the key-value database. Key-value is designed for heavy read applications and leverages in-memory processing or solid-state disk (SSD) to provide extreme performance.
NoSQL, with its flexibility and high performance at scale, makes it ideal for supporting the processing demands for what we commonly call Big Data. Of course, any discussion of Big Data naturally evolves into a discussion of Hadoop. Hadoop isn’t a true datastore but is an ecosystem of utilities that provides high performance through massively parallel processing. At the core of Hadoop is MapReduce, which spreads complex computations and queries across many servers. Hadoop tools, in conjunction with NoSQ, provide an environment capable of supporting complex, high volume data requirements.
Data streaming has evolved as a common problem for businesses as more decisions are made in real-time. The need to analyze and operate on streaming data has driven the rise of technologies, including Spark and Kafka. Like Hadoop, Spark is also an ecosystem. Spark, developed by Apache, is designed for high volume streaming data. Spark’s file system was uniquely designed to provide the throughput and processing capabilities needed to support high volume data streaming. Kafka is a client library that is easily integrated into various applications. As with Spark, Kafka has numerous capabilities that allow the inspection and action of data moving in real-time streams.
In a typical organization, a mix of these technologies will be used to achieve desired business outcomes. For example, relational databases will be used for transactional processing, Hadoop could support a centralized repository for data such as a Data Lake, and a version (or versions) of NoSQL could provide a high-performance environment for analytic model development and query.
Shaping these diverse technologies in a unified manner is the challenge of the data strategy and the resulting data architecture. Organizations that grasp these concepts and understand how to leverage their data assets to drive new business models and outcomes are those that will excel in today’s data-driven economy.