Ernst-Rudolf Töller
Author: Ernst-Rudolf Töller

Free-swimming in the „Data Lake“

„Data Lake“, more than a new buzzword?! 
In the future, data analysis will make more comprehensive use of the various data sources of a company. The own data sources should be considered together with publicly accessible data as a big whole as a so-called "Data Lake".

Gartner: "A data lake is a collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format. The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse). (cf. www.gartner.com/it-glossary/data-lake)

The concept of "Data Lake" seems fascinating at first glance, but on closer inspection it raises considerable questions. Here is a small selection (not exhaustive):

  • Comprehensive and consistent metadata
    Sufficient and compatible metadata must be available for all sources involved. This also includes a comprehensive knowledge of the semantics of the data. Even basic terms such as 'Kontoʼ, 'transaction' etc. do not automatically have the same meaning in different systems.
  • Connection between 'foreign' data sources
    Connections between the tables of a database are predefined by the data structure. In addition, however, this also involves links between completely different, possibly heterogeneous data sources. Such connections are, on the other hand, not simply possible because, for example external data sources do not contain keys that match the typical key attributes such as 'customer number', 'vendor number', etc. from their own databases.
  • Accuracy and quality of the various sources
    Different data sources may be problematic in their actuality or very different in the quality of the data. The subject of data quality, which is always present in data analysis anyway, must be clarified for each data source involved. The question of the topicality of the data sources involved is of particular importance for data analysis. Here it is also important to avoid 'false positives'. Such cases can result from the fact that different data sources are updated differently.
  • System technology: large amounts of data, heterogeneous data structures and formats
    Gartner's proposed definition of a data-lake speaks of copies of the various data sources that are combined on their own platform. This also requires the creation of system-technical prerequisites. The requirements to the used technology can be substantial dependent on the data quantities and the cycles of the update of the various sources.

In many companies there is a rethinking towards new digital business models, if necessary also cross-company business models. This rethinking also means a new view of the company's data.

  • the own IT systems are only part of a more comprehensive system and data landscape.
  • not only your own systems should be used as a source of information for the company.
  • Additional data sources are to be connected without data being transferred to the company's own systems.

The idea of data-lake corresponds to this paradigm shift in IT. "Everything has to do with everything", that's of course the case, but it doesn't say anything about the really interesting relationships in detail. If you want to successfully implement new concepts for the cross-company connection of data, you need new, more in-depth approaches to solutions. In the following, we want to show what the use of data derivatives instead of conventional data attributes can contribute here.

Data Derivatives and Data Attributes

Today's ERP systems store information in a variety of tables and data attributes. If the meaning of the individual tables and attributes is still visible in the documentation of the systems, the overview of the most important constellations of the characteristics of two or more attributes becomes much more difficult. (This is also due to the fact that the number of possible values of a larger group of attributes can assume astronomical proportions.) Of course, this situation is not made easier by the use of other data sources.
Suitable data derivatives, on the other hand, automatically combine groups of data attributes. Dependencies between the individual attribute values can also be taken into account. Derivatives can also be aggregated to form a whole group of data records, e.g. from all documents of an account.

Connecting what belongs together

The limits of one' s own data are of course not the world' s limits for companies, but in many cases they form a noticeable threshold for new digital business processes. Overcoming such thresholds is of course becoming increasingly important in times of digitalization. The manual transfer of more and more data into one's own systems is often not an alternative for reasons of time and cost. The use of data derivatives opens up completely new possibilities here. They can be used to connect data that is not connected via conventional key relationships.

This creates a view of data that allows the idea of data-lake to become reality to a certain extent:

  • Data attributes are no longer considered individually, but are made visible as a new whole in their connections to each other.
  • Links between data are no longer restricted to the use of central key attributes such as 'material number, customer number', and so on.

We will be happy to show you in detail how you can break new ground in data analysis with dab:AnalyticIntelligence. For further information or a personal meeting, please contact us. We will get in touch with you immediately!

Comments (0)
Be the first who comments this blog entry.
Blog login

You are not logged in. Please log in to comment this blog entry.

go to Login