Data Debt > Tech Debt

Tech Debt is Just the Tip of the Iceberg – Welcome to the Deep Waters of Data Debt

Jan 23, 2024

In recent years, I've collaborated with numerous companies embarking on a journey to harness the vast amounts of data generated every second. What I've found is a landscape rife with terror and panic, further exacerbated in the last year by the hype surrounding Generative AI, especially the ChatGPT phenomenon.

The classic question has been

How can we leverage Generative AI in our business?

The response tends to be less exhilarating than anticipated, primarily because the effective utilization of AI hinges on a critical element: DATA.

Understanding an organization's data maturity is essential in this context. Data maturity refers to an organization's capability to effectively

Manage Data
Process Data
Use Data

It includes various elements such as data quality, systems for data management, the extent of data integration throughout the organization, and the capability to utilize data for strategic insights and decision-making.

When an organization exhibits low data maturity, it typically faces several challenges. Firstly, there's often a lack of clarity regarding the data they possess, leading to underutilization of valuable assets. Secondly, the absence of robust data governance mechanisms results in unchecked data inflows and outflows, compromising data integrity and security. Inconsistencies in data further complicate analysis and decision-making processes. Additionally, such organizations usually incur significant costs in data consumption and sharing between systems, as inefficiencies and redundancies hamper smooth data integration.

I usually call this scenario a Data Debt scenario.

We are all familiar with the term tech debt and its meaning in Software Engineering

In software development, is the implied cost of future reworking required when choosing an easy but limited solution instead of a better approach that could take more time1

In my role as a Software Engineer, I've both accumulated and addressed technical debt. It's a universal truth in software development: all software has some degree of technical debt. This is an inherent aspect of software as a dynamic entity, constantly evolving to meet changing needs and requirements. In certain projects, I've observed technical debt reach such critical levels that a thorough trade-off analysis recommended either a complete rebuild of the platform (an amazing story is described HERE by

Gergely Orosz

in his

The Pragmatic Engineer

) or replacement with an off-the-shelf solution. While such measures might appear extreme or perceived as a disregard for previous investments, drastic and sudden shifts in circumstances can sometimes leave no alternative.

As an Architect working in large, data-intensive systems, I've observed a crucial distinction in handling technical and data-related debts. In software systems, addressing technical debt often involves rebuilding or replacing the underlying architecture or code – essentially, the tools facilitating value exchange. However, when it comes to data debt, this approach is not feasible. Data itself embodies the value, and unlike tools, it cannot simply be replaced.

Data is the new oil2

For instance, consider an organization grappling with substantial data debt. The data, being the core asset, holds intrinsic value and insights crucial for the organization's operations and strategic decisions. If this data is plagued with inconsistencies, redundancies, or accessibility issues, it directly impacts the organization's ability to operate efficiently and make informed decisions. Unlike software components, data cannot be discarded and rewritten from scratch without losing valuable insights and historical context.

Therefore, the strategy to mitigate data debt requires a nuanced approach. It involves rigorous data governance, quality control, and the implementation of robust data management practices. These measures ensure that the data remains accurate, consistent, and accessible, thereby sustaining the organization's ability to leverage it for strategic advantage.

Let me try to draw you an example scenario of how this work in the real world

Scenario:

GreenTech, a mid-sized eco-friendly products company, has been operating a successful e-commerce platform for the past five years. The platform, built on a legacy e-commerce engine, has served its purpose but is now struggling to keep up with the evolving market demands and technological advancements. GreenTech's leadership decides to replace the outdated e-commerce engine with a more modern, scalable solution.

However, a significant challenge emerges as they go deeper into the migration process. GreenTech's data landscape is a complex web of storage systems spread across the organization. Over the years, data duplication, lack of integration, and insufficient documentation have led to a convoluted and inefficient data environment. This scenario presents two distinct types of technical debt:

1. E-commerce Engine Debt:

- Nature: This debt is primarily technology-focused, centred around the outdated e-commerce engine.

- Solution: Replacing the old engine with a new, modern system is straightforward. The new system, offers improved scalability, flexibility, and better integration capabilities with contemporary tools and platforms.

- Challenge: While the technical aspect of this debt is addressable through replacement, ensuring seamless migration of existing data and integration with other company systems requires meticulous planning.

2. Data Landscape Debt:

- Nature: This debt is more intricate, stemming from years of ad-hoc data storage practices, resulting in duplicated, scattered, and poorly integrated data.

- Solution: Unlike the e-commerce engine, this debt cannot be resolved through a simple replacement. It requires a comprehensive audit of the existing data landscape, identification of data duplication, and development of a unified data management strategy.

- Approach: Implementing a central data warehouse or data lake can consolidate data storage. Additionally, employing data governance practices and documentation will ensure future data integrity and accessibility. This process involves not just technological changes, but also organizational and process adjustments.

Leasson Learned:

In GreenTech's scenario, while the e-commerce engine debt is resolved by a technology upgrade, the data landscape debt demands a more holistic approach. It requires a blend of technological solutions, process re-engineering, and cultural change within the organization to foster better data management practices. In conclusion, when comparing technical debt and data debt in the context of the GreenTech example scenario, it's crucial to understand their distinct cost implications. Technical debt, often stemming from expedient but suboptimal design or coding choices, carries a cost primarily in terms of future code maintenance and potential refactoring. It's a debt of development shortcuts and compromises, which, while costly, is often contained within the realm of software development processes.

Data debt, on the other hand, is typically more onerous in terms of cost - monetary, time, effort, and human resources. This is because data debt include not only the technological aspects but also the broader implications on business operations, decision-making, and strategic planning. Data debt can arise from issues such as poor data quality, lack of data governance, outdated data models, or inadequate data integration. Its resolution requires a multi-faceted approach involving not just technological fixes but also changes in organizational processes, data governance policies, and potentially, a cultural shift in how data is perceived and managed across the organization.

In the GreenTech scenario, addressing data debt could involve extensive data cleaning, implementing new data governance protocols, retraining staff, and perhaps even overhauling entire systems to ensure data integrity and relevance. The costs here are not just financial, but also involve significant time and effort in reorienting the organization’s data practices. The human resource investment is also substantial, as it may require specialized skills in data management, governance, and analysis, which are often in high demand.

Therefore, while both types of debt incur costs, data debt often proves more challenging and expensive to resolve, given its pervasive impact on the broader organizational ecosystem. It underscores the importance of proactive data management strategies to mitigate these costs and align data practices with the organization's evolving needs and goals.

Now imagine a company like GreenTech eager to implement Generative AI functionalities. As Architects, our duty extends beyond mere implementation. It is imperative to first conduct a thorough evaluation of the company’s existing data infrastructure and maturity. This involves assessing the quality, integration, and governance of their current data systems. Our role is to guide GreenTech in understanding that the successful deployment of Generative AI technologies relies heavily on the robustness of their data architecture. We need to advise them on the necessary steps to enhance their data management capabilities, ensuring that their data ecosystem is primed for the complexities and demands of advanced AI applications. This strategic approach not only aligns with technological advancements but also positions the company to harness the full potential of Generative AI, thereby driving meaningful and sustainable business outcomes.

In such a scenario, the initial focus should not be on selecting the AI model or determining the most suitable MLOps architecture. Instead, the starting point must be a fundamental reshaping of how data is collected, stored, managed, and consumed across the organization. This approach represents a significant shift in focus for executives, who must prioritize foundational data management strategies over being swayed by the prevailing hype around technologies like ChatGPT.

In my experience, this challenge is often overlooked in the rush to adopt new technologies. Yet, it is crucial to understand that the successful implementation of any advanced AI or machine learning solution, including generative AI, is deeply rooted in the organization's data infrastructure. Without a robust, efficient, and coherent data management framework, even the most advanced AI technologies will struggle to deliver their full potential.

Therefore, before going into the implementation of AI capabilities, it is essential for organizations to critically assess and, if necessary, overhaul their data practices. This involves ensuring data integrity, eliminating redundancies, integrating disparate data sources, and establishing clear governance and management protocols. Only with these foundational elements in place can organizations truly leverage the power of advanced AI technologies in a meaningful and sustainable way.

In conclusion, while the allure of rapidly implementing genAI functionalities like ChatGPT is undeniable, it's crucial to anchor our approach and expectation in the reality of data management and organizational readiness. I invite you to reflect on how your organization manages its data ecosystem and consider whether it's truly primed to leverage advanced AI technologies effectively. Share your thoughts, experiences, or any challenges you've faced in aligning your data strategies with emerging AI trends

toString()

Discussion about this post