How a Distributed Data Mesh can be both Data Centric and Event Driven
“By 2025, over a quarter of new cloud applications will use data-centric event-driven architectures rather than traditional code-centric ones, enabling better automation and business agility.”
— IDC Cloud Predictions 2021.
From Code Centric to Data Centric
This article is the follow-up of my previous article “From Code Centric to Data Centric” therefore you may wish to refer to it while reading this new one. Please find below a couple of definitions from it.
Data Centric refers to an architecture where data is the primary and permanent asset, and applications come and go. In the data centric architecture, the data model precedes the implementation of any specific application and will be there and valid long after it is gone.
Therefore, with a Code Centric architecture, the semantics of data are embedded and residing in the code of the applications, while with a Data Centric architecture the semantics of data are managed as a key asset together with data, independently from the code of the applications.
Domain Driven Design supported by Data Centricity
I have enjoyed reading this article from Zhamak Dehghani on “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” proposing to apply the Domain Driven Design principles to data platforms. You may wish to refer to it as well while reading this article.
For sure, if both data producers and consumers are black box applications and organizations, then a third party needs to ask a producer for data in order to provide a consumer with it. This job is usually done by a central team in charge of the Data Lake or Warehouse, also performing transformations, cleansing or correlations of data as needed.
Unfortunately replacing large monolithic applications with micro-services doesn’t solve the problem, on the contrary it could further complicate the situation since a micro-service is just a small portion of an application domain, a kind of a smaller black box, therefore it will be even more difficult to provide a meaningful and comprehensive view of the data of a full domain.
However, if applications or services are not developed as black boxes, which is the case with a Code Centric approach, but on top of a shared data platform designed to be Data Centric, then the principles of Domain Driven Design can be more easily applied for building a Distributed Data Mesh.
In the real world, enterprise companies are quite far from providing “data as a product”, since teams in charge of application domains have totally different goals than providing a “discoverable”, “addressable”, “trustworthy”, “self-describing”, “inter-operable” and “secure” data as a product to other application domains. A Data Centric approach can help with this by allowing to develop new applications on top of a “common” data platform, shared by design at enterprise level, decreasing the time and effort needed by developers just for providing someone else with high quality data in a comprehensive, secure and governed way.
Data Centricity can further support a Distributed Data Mesh architecture by improving a “centralized governance”, especially when domains are not completely independent and disjoint, but they need to share portions of the data platform for collaborating over shared data both as consumers and as producers.
Not just Operational Producers, not just Analytical Consumers
I have enjoyed also reading this article from the same author “Data Mesh Principles and Logical Architecture”, and while I agree with most of the content, to be honest I’m not in agreement when she repeatedly pose the distinction between operational and analytical data management.
In reality, domains are not only consuming data for analytic purposes, and similarly data producers are not only operational domains but also analytical ones. The boundaries between consumers and producers, and between operational and analytics, don’t make sense anymore.
Analysts invented new terms like “translytic” trying to describe the merging between “transactional” and “analytic” data management.
Mixing operational and analytic data management means that analytic queries are not run anymore on static data, but on real-time continuously changing data, with the risk of returning wrong results if the data infrastructure cannot guarantee the consistency. In addition, operational workloads are exploiting analytical queries while accessing and updating data.
This is the reason why the Data Centric capability of running consistent queries on real-time changing data is becoming a must not only for operational but also for analytic purposes.
Centers of Data Gravity
A Distributed Data Mesh is a powerful approach for designing a distributed data platform and the related organization following the Domain Driven Design principles. Whenever possible also the data infrastructure should be distributed, although the design of the data platform should carefully consider also aspects like data gravity and consistency.
With the dramatic growth of data volume, moving data is becoming more and more difficult, slow and expensive. It’s much easier to move applications to data rather than data to applications. Centers of data gravity are “attracting” not only applications but also other data needing to be correlated, and they must be carefully considered while designing a data architecture.
It’s not just a matter of location, impacting for instance on how to best leverage public cloud with hybrid or multi-cloud architectures, but also the technologies to use to store and manage data, especially if data is supposed to be both updated and correlated in real time manner. This is the reason why, for instance, sometimes it’s not a good idea to use many distinct specialized database technologies, since sooner or later someone could ask to correlate data across.
I would like to avoid to be misunderstood; I’m not advocating for a centralized and monolithic data platform, but for improving the company governance of the data platform as well as for being conscious about the impacts of topics like data gravity, that is usually considered too late, only when moving a project from development to production.
Data Virtualization, the good and the bad
I’m noticing that many authors and analysts are describing the Data Virtualization as the “panacea” for any data management issue, including the data gravity, specially for correlating data residing on different databases. I think we have to put more clarity on it.
Data Virtualization is very useful for providing data abstraction through a logical data model well described by a data catalog. It works pretty well for key based access of few records, and it’s very useful for enforcing the data access security control.
But a large data cache is needed for achieving acceptable performances for significant volumes of data, therefore Data Virtualization cannot work with real time changing data; but reducing the cache would not protect the data sources from the additional workload coming from the data consumers.
And to conclude, large join queries and analytic queries full-scanning and correlating large amounts of data are simply not feasible because of the impact of the data gravity, and queries across different data models are still not available.
Event Driven, the way
So, how to build a Distributed Data Mesh without being hit by the data gravity issue? Let’s try to do it by leveraging the Event Driven architecture while keeping a Data Centric approach.
Ben Stopford’s article on Data Dichotomy introduces the concept of sharing of domain datasets through streams of events. Let’s see if this concept could be applied in order to build a distributed architecture also when application domains are not fully disjoint but sharing some data.
The idea seems to be quite promising, but being also based on Martin Fowler’s famous article on Event Sourcing, not surprisingly it is mostly focused on (micro)services and therefore very Code Centric, trying to “build Services leveraging a Backbone of Events” as described in another article by Ben Stopford. Trying to summarize main concepts from this article, there are three ways the services can interact with one another: Commands, Events and Queries:
A possible approach for implementing the sharing of data between services is to leverage the Event Streaming and the Single Writer Principle. In this way only one service owns the “master copy” of a specific data-set and is allowed to update it, while other services can just query a read-only copy in near real-time.
The idea seems to work pretty well, but the pattern is too Code Centric, and the need to write a lot of code on services for managing the streams of events is over complicating the implementation.
We have a smarter and Data Centric alternative for implementing the same pattern: transactional Databases are managing the transactional consistency by automatically storing all committed transactions into an Event Log, that can be considered as a kind of immutable source of truth for transactions.
So, Databases are very powerful tools for building an event driven architecture, allowing developers to write less code for implementing and maintaining the event streaming, thus moving from a Code Centric to a more Data Centric approach, also enforcing the consistency of services.
Databases are very useful not only for easily producing streams of events, but also for easily consuming the streams of events that are updating in near-real-time a copy of the database. Indeed databases allow to leverage the information provided by a stream of events not just as a stream of atomic values, but in the richer semantic context provided by the data model and the metadata of the near-real-time copy of a database.
Like in the game of chess, you can consider the list of moves or the current status of the chessboard. The list of moves can provide a lot of useful information, but the next move is mostly decided based on the current status of the chessboard.
Indeed the next transaction can depend on the current status, since the updates of a transaction can depend on previous queries, like when checking the availability of money before paying: we don’t care if in the past the money was available, we only care about how much is currently available. On the other side, if the goal is to estimate the churn propensity of a customer, then analyzing the stream of events would be much more relevant than the current status.
Databases and Event Logs are not mutually exclusive, but two faces of the same coin. They are companions and both necessary for building a Data Centric and Event Driven architecture.
But the Databases should evolve in order to become the cornerstone of an Event Driven architecture, providing capabilities for leveraging Event Logs for much more semantic and functional purposes than just data replication for incremental backups, active disaster recovery, or for rolling back a database to a specific timestamp.
Back to the actual design of a Distributed Data Mesh including many Data Domains, we should also try to group all the services of a specific Domain in order to simplify the architecture, and to provide a simple and Data Centric way to share data both between the services within the same domain and to other external domains.
With this approach, every Domain can easily manage its Master Data shared across its services, and can also provide other domains with its Master Data in near-real-time. The same Domain can also use in a consistent way the read-only and near-real-time data provided by other domains.
This can be achieved dramatically decreasing the amount of code developed for implementing and maintaining the event streaming in order to share data both inside and outside the domain, thus avoiding a Code Centric approach and implementing a Data Centric architecture.
By the way, just in order to avoid any misunderstanding, by “Domain Master Data” I’m not meaning the “classic” concept of Master Data Management, but simply the master copy of data where the Single Writer Principle is applied.
To close, the best way to implement a Distributed Data Mesh is to build an Event Driven architecture leveraging a scalable Event Driven backbone like Kafka for handling events both at service and database level, both inside and outside the domains. But the Event Driven architecture should be also Data Centric, relying as much as possible on the ability of databases to translate transactions into consistent streams of events and then back to a near-real-time copy of the database, dramatically decreasing the need of carving in the code the logic of data.
Trying to summarize
As I anticipated on the subtitle, IDC Cloud Predictions for 2021 stated:
“By 2025, over a quarter of new cloud applications will use data-centric event-driven architectures rather than traditional code-centric ones, enabling better automation and business agility”.
When I read this sentence for the first time, I was wondering how could actually be a “data-centric event-driven architecture”.
I think that the kind of Distributed Data Mesh described in this article could be a good example of it.