Data Catalog

Beyond Enterprise Data Lineage: The Case for a Platform-Independent Data Catalog

Data lineage is one of the most requested capabilities in enterprise data governance. It is also one of the most misunderstood. Most organizations start by asking: how do we track where our data comes from and where it goes? They end up discovering a harder question: why can none of our existing tools answer that question across all of our systems?

The answer leads directly to the data catalog. Data lineage is not a standalone problem. It is one capability, an important one, inside a broader set of requirements that only a platform-independent, enterprise-wide data catalog can satisfy.

A couple of years ago, I wrote about open standards for data lineage, specifically OpenLineage and its integration with Apache Kafka and Flink. This article builds on that foundation, but shifts the frame: data lineage is the entry point, and an independent data catalog is the destination.

Data Lineage Is Not a Solved Problem. And Not a Standalone One.

OpenLineage is a valuable initiative. So is the Open Data Contract Standard (ODCS). Both are moving in the right direction. However, adoption is slow. Enterprise architecture does not wait for standards bodies. Most organizations cannot freeze implementation decisions for two or three years waiting for broad ecosystem support.

There is also the custom requirements problem. No standard covers every edge case in a complex enterprise. A custom SAP connector, a proprietary trading system, a legacy mainframe feed: these exist in almost every large company. Something needs to bridge them into whatever governance framework you are building.

And sometimes pragmatic and quick is good enough. If an interface is simple and low-risk, a lightweight custom connector or manually documented integration is the right call. Not everything needs to be standardized. In a world where GenAI and agentic AI are reshaping enterprise architecture at speed, over-engineering governance infrastructure is its own kind of risk. The goal is practical coverage, not perfect coverage.

But the deeper issue is this: even if you solve data lineage perfectly inside one platform, you have not solved it for the enterprise. And lineage alone does not answer the questions that compliance teams, business analysts, data product owners, or CDOs are actually asking.

What the Business Actually Needs: Beyond Lineage

Data lineage tends to get framed as an engineering requirement. It is not. It is an enterprise-wide business requirement, and different stakeholder groups need very different things from it.

Engineering teams need column-level lineage, automated capture, and dependency visibility before deploying schema changes. Manual documentation does not scale.

Compliance and legal teams need to answer regulators: where is personal data stored, who has access, where does it flow? GDPR, CCPA, HIPAA, MiFID II each create concrete audit requirements. Data privacy officers specifically need field-level PII tracking to handle data subject access and deletion requests with confidence. Lineage here is a compliance artifact, not an engineering one.

Data consumers and business analysts need to know whether a number is trustworthy. What does this field mean, where does it come from, when was it last updated, has it passed quality checks? They will not read pipeline documentation. Context needs to be surfaced in plain language, attached to the data asset itself.

CDOs and governance teams need organization-wide visibility: ownership, quality scores, policy enforcement, and a single place to understand what data the organization has and whether it is governed properly.

Three Layers of Metadata: Technical, Business, and Operational

These requirements share a common thread. They all depend on metadata being captured, enriched, connected, and made accessible across systems. Following the DAMA DMBOK2 framework, that metadata falls into three distinct layers:

  • Technical metadata covers schemas, data types, table structures, and lineage graphs.
  • Business metadata covers ownership, glossary definitions, classifications, and data product descriptions.
  • Operational metadata covers runtime behavior: job execution status, data freshness, quality metrics, and SLA tracking.

This last layer is where standards like OpenLineage are specifically valuable, capturing lineage events at runtime from tools like Apache Flink. A data catalog that only handles technical and business metadata leaves governance teams blind to what is actually happening in production. You cannot satisfy enterprise governance requirements with any one layer alone.

A related dimension worth flagging: if your organization has moved toward a data mesh architecture with domain ownership of data products, lineage governance becomes a cross-domain coordination problem. The catalog layer needs to support domain-level ownership while still providing enterprise-wide visibility. Another reason why a neutral, organization-wide catalog matters more than platform-specific tooling.

Why Vendor-Specific Lineage Falls Short

Confluent Cloud has solid data lineage for Kafka topics, Flink jobs, and connectors. With IBM’s acquisition of Confluent, how that capability evolves under IBM ownership remains to be seen, but within its current scope it works well. Snowflake has Horizon. Databricks has Unity Catalog. AWS offers DataZone. Each does a good job within its own platform boundary.

The problem is the boundary. Your enterprise is not one platform. It is Kafka plus Flink for streaming, Databricks or Snowflake for analytics, Iceberg for the data lake, SAP for ERP, Salesforce for CRM, PowerBI or Tableau for reporting, and probably several custom applications built over the past decade. None of the vendor-specific lineage tools span this landscape.

A compliance officer asking where a customer’s personal data flows cannot answer that from Confluent’s lineage graph alone. An engineer asking whether a schema change in an upstream Kafka topic will break a downstream Snowflake report cannot answer that without switching tools. Lineage that stops at a platform boundary is partial lineage, and partial lineage creates false confidence.

Vendor-specific lineage is a foundation. It is not a solution. And it addresses only one of the business requirements listed above.

The Case for a Platform-Independent Data Catalog

The logical conclusion is a catalog layer that sits above all platforms, integrates with all of them, and is owned by none of them. That is what an independent data catalog provides.

The argument for independence: your enterprise architecture will change. Platforms get replaced, vendors get acquired, contracts expire. If your governance layer is tightly coupled to one vendor, every platform change becomes a governance migration project. There is also an incentive problem. A vendor’s built-in catalog naturally favors their own platform. That is rational product development. An independent catalog has no such incentive.

The counterargument: independence comes with integration and operational cost. A catalog connecting twelve platforms needs twelve integrations, each requiring maintenance. Running an open-source catalog at scale requires dedicated ownership. It is not trivial. For smaller organizations with a limited technology footprint, vendor-specific tools may remain the pragmatic choice.

For large organizations running heterogeneous architectures across multiple clouds and platforms, independence is worth the overhead. The trend toward hybrid, multi-cloud environments makes that argument stronger every year.

DataHub as a Practical Reference Point

DataHub is one of the most widely adopted open-source data catalog available today. Originally built at LinkedIn and open-sourced in 2020, it has since grown into an independent community project, formally separating from LinkedIn in 2025, with over 3,000 organizations running it in production.

Native support covers Apache Kafka, Apache Flink, Kafka Connect, Apache Iceberg, Snowflake, Databricks, PowerBI, dbt, and various relational databases. It ingests OpenLineage events natively, so existing Flink lineage flows in without additional work. Column-level lineage is captured across all sources in a unified graph. Enterprise authentication and role-based access control are supported, closing a historical gap in open-source governance implementations that made tools like Marquez unsuitable for regulated enterprise use at scale.

For systems without a standard connector, the AI-assisted connector framework matters. A reader who implemented DataHub in a Kafka-heavy enterprise shared a concrete example: they used AI to scan a custom RFC-based SAP Java connector and auto-generated not just connector documentation but column-level lineage imported directly into the catalog. With tools like Claude Code, building a custom DataHub connector for a proprietary system is a realistic engineering task, not a multi-month project. That closes the gap between what standards cover and what your architecture actually requires.

The integration depth goes beyond connectivity. With Kafka, DataHub ingests not just topic names and schemas but field-level descriptions, example values, glossary links, and ownership metadata. A data consumer gets a view of a Kafka topic they can actually understand.

One concrete compliance workflow: a team running a GDPR audit searches DataHub for all assets containing a PII-tagged field, traces that field downstream across Kafka, Snowflake, and PowerBI, identifies data owners, and exports the result for a regulatory report. That workflow requires a catalog that integrates all of those systems. No single-platform lineage tool can deliver it.

Market Context: The Data Catalog Landscape

The commercial market is consolidating fast. Peter Baumann’s research report “Data Catalogs: Market Perspective 2025/2026” documents 42 vendors in the space and tracks a significant acquisition wave: Salesforce acquired Informatica, Snowflake acquired SelectStar, ServiceNow acquired data.world, Atlassian acquired Secoda, Coalesce acquired CastorDoc. The common thread is AI positioning. Platform vendors are buying catalog capabilities to extend their governance story into AI workflows.

For any organization that wants their catalog to remain platform-neutral, this consolidation is a risk signal worth tracking. Each acquisition ties catalog capabilities more tightly to a specific platform ecosystem.

Baumann’s research also confirms something formal analyst coverage underrepresents: open-source catalogs like DataHub and OpenMetadata are regularly evaluated and adopted by large enterprises, despite rarely featuring in Gartner or Forrester rankings.

DataHub, covered in detail above, is the reference example in this article. The other credible open-source option is OpenMetadata. Where DataHub is built around a graph-based metadata model with a strong focus on discovery and lineage at scale, OpenMetadata takes a more schema-first approach, defining all metadata entities through a formal open metadata standard. Its community has grown particularly strong in the analytics engineering space. The choice between them comes down to your architecture, team background, and community fit.

Among commercial options, Collibra and Alation remain dominant for large enterprises. Microsoft Purview is relevant for Azure-heavy organizations. The hyperscalers continue to extend catalog capabilities but optimize for their own ecosystems rather than serving as neutral cross-platform layers.

Data Lineage Is the Question. The Data Catalog Is the Answer.

Data lineage is the question that leads you to the data catalog. Once you try to answer it seriously across compliance, engineering, business consumers, and governance teams, you quickly find that lineage alone is not enough, and platform-specific lineage is not enough either.

The practical architecture is layered: platform-native lineage for your core systems, combined with an independent catalog that integrates everything into a unified view. That catalog should be platform-independent, ideally open or open core, and built to outlast the specific technology choices your organization makes today. Options like DataHub and OpenMetadata achieve this without locking you into any single analytics, streaming, or middleware vendor.

One development worth watching closely: MCP, the Model Context Protocol, is rapidly emerging as a standard for exposing data and tool context to AI agents. Data catalog vendors are beginning to add MCP support, and the direction is significant. A well-maintained catalog with rich metadata becomes a natural context source for agentic AI workflows, letting AI agents understand what data exists, who owns it, what it means, and whether it is trusted, before acting on it. MCP is still early in the data catalog space specifically, but it positions the catalog not just as a governance tool but as the metadata backbone for enterprise agentic AI. That is a compelling reason to invest in the catalog layer now, before agentic AI use cases outpace your metadata infrastructure.

Data lineage is not a solved problem. It is not a standalone problem either. It is the entry point into a broader conversation about how enterprises manage, govern, and trust their data at scale. The data catalog is where that conversation lands.

Kai Waehner

bridging the gap between technical innovation and business value for real-time data streaming and applied AI.

Recent Posts

Data Ownership in the Age of Agentic AI: Why SAP’s API Policy Forces a Data Integration Reckoning for Every Enterprise

Every enterprise is being told to go agentic. Meanwhile, the platforms holding your most critical…

2 weeks ago

Flink CEP and Agentic AI: Real-Time Pattern Detection as the Foundation for Autonomous Decisions

AI agents fail in production when they are connected directly to raw event streams. Flink…

3 weeks ago

Complex Event Processing (CEP) with Apache Flink: What It Is and When (Not) to Use It

Complex Event Processing is the most underused capability in Apache Flink. It detects meaningful event…

1 month ago

MCP vs. REST/HTTP API vs. Kafka: The Architect’s Guide to Agentic AI Integration

MCP, REST/HTTP APIs, and Apache Kafka are not alternatives. They solve different problems at different…

1 month ago

Enterprise Agentic AI Landscape 2026: Trust, Flexibility, and Vendor Lock-in

The Enterprise Agentic AI Landscape 2026 maps every major AI vendor across two dimensions that…

1 month ago

The Trinity of Modern Data Architecture: Process Intelligence, Event-Driven Integration, and Trusted Agentic AI

Agentic AI without governed processes is fast but ungoverned. Event-driven integration without process intelligence moves…

2 months ago