Data In Motion is not just Data Streaming
Unless you are able to play chess without a chessboard…
Extracting value from data is not enough if the value is not having a real impact on the business. Even AI/ML models trained on historical data are quite useless if they are not deployed in production systems and applied on real time data.
It’s not just a matter of analysing data or training AI/ML models, it’s also a matter of taking real time actions for getting results, leveraging real time changing data!
But, what does real time changing data mean? Is it just data streaming?
Not only…
Let me give you a simple example, through the game of chess.
A chess game can be easily represented by a list of moves, as seen in the following table representing the first moves of the famous “Immortal” chess game, played by Adolf Anderssen and Lionel Kieseritzky on June 21, 1851:
The list of moves is a kind of stream of events perfectly describing the chess game, however a chess player needs to look at the chess board in order to decide the next move, just looking at the list of events is not enough.
Therefore, although a stream of events can describe well how a system is changing, keeping an accurate and up to date view of the state of the system is key to decide how to change it.
The stream of events is often the result of the event sourcing describing the changing states of the system:
And even when two remote chess players are playing leveraging a kind of distributed mesh of systems, streams of events can be used for keeping up to date the two remote systems, i.e. the two remote chessboards.
Now the difference between Data In Motion and Data Streaming is clear:
Data In Motion is the continuously changing data representing the state of the system, often constantly changing in real time.
Data Streaming is just the list of events describing the changes of the system, that sometimes can be produced or consumed in real time as well.
Both are needed, since although they are describing the same system, continuously changing in real time, they are used for different purposes.
Data Streaming is more suitable for discovering trends, training AI/ML models based on the historical data, or looking for patterns on time series.
Data Streaming can also be used for taking real time actions, but only if the data contained in the events provides enough information to decide what action to take.
Otherwise, in order to be able to take the decision on what the next action should be, the consistent state of the system at a specific point in time is also needed, and this is the biggest challenge.
By the way, this is also one of the main challenges of Data Lakes.
Data Lakes are well suited for managing and storing time series and streams of events, as well as for managing and storing bulk files and running on top of them massive batch transformations.
But not handy at all for providing the consistent state of a non trivial system at a specific point in time, without mentioning the total ineffectiveness on keeping consistently up to date the current state of a continuously changing system, especially when the system is changing based on many independent and asynchronous streams of events.
This is where the concept of Data In Motion gets paramount.
Working on continuously changing data is tricky, and capabilities like transactional consistency and read-write concurrency are key to avoid providing wrong data as well as achieving good performances and scalability.
This is why transactional databases are much more suitable for managing the continuously changing state of a system, avoiding the provision of inconsistent data as well as avoiding the serialisation of all read and write accesses to data.
Making actionable AI/ML trained models is a good example of how much nowadays the ability to manage Data In Motion is getting more and more important. Indeed, everyone is focusing on training AI/ML models, but what about deploying the trained models in production?
AI/ML models are usually trained on historical data, while real time data is needed in order to make them actionable, and since often data are coming from many different data sources, the ability of managing and providing Data In Motion in a consistent and concurrent way is key, as I already described on the use case at the end of this article.