Free-swimming in the „Data Lake“
„Data Lake“, more than a new buzzword?!
In the future, data analysis will make more comprehensive use of the various data sources of a company. The own data sources should be considered together with publicly accessible data as a big whole as a so-called "Data Lake".
Gartner: "A data lake is a collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format. The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse). (cf. www.gartner.com/it-glossary/data-lake)
The concept of "Data Lake" seems fascinating at first glance, but on closer inspection it raises considerable questions. Here is a small selection (not exhaustive):
- Comprehensive and consistent metadata
Sufficient and compatible metadata must be available for all sources involved. This also includes a comprehensive knowledge of the semantics of the data. Even basic terms such as 'Kontoʼ, 'transaction' etc. do not automatically have the same meaning in different systems.
- Connection between 'foreign' data sources
Connections between the tables of a database are predefined by the data structure. In addition, however, this also involves links between completely different, possibly heterogeneous data sources. Such connections are, on the other hand, not simply possible because, for example external data sources do not contain keys that match the typical key attributes such as 'customer number', 'vendor number', etc. from their own databases.
- Accuracy and quality of the various sources
Different data sources may be problematic in their actuality or very different in the quality of the data. The subject of data quality, which is always present in data analysis anyway, must be clarified for each data source involved. The question of the topicality of the data sources involved is of particular importance for data analysis. Here it is also important to avoid 'false positives'. Such cases can result from the fact that different data sources are updated differently.
- System technology: large amounts of data, heterogeneous data structures and formats
Gartner's proposed definition of a data-lake speaks of copies of the various data sources that are combined on their own platform. This also requires the creation of system-technical prerequisites. The requirements to the used technology can be substantial dependent on the data quantities and the cycles of the update of the various sources.
In many companies there is a rethinking towards new digital business models, if necessary also cross-company business models. This rethinking also means a new view of the company's data.
- the own IT systems are only part of a more comprehensive system and data landscape.
- not only your own systems should be used as a source of information for the company.
- Additional data sources are to be connected without data being transferred to the company's own systems.
The idea of data-lake corresponds to this paradigm shift in IT. "Everything has to do with everything", that's of course the case, but it doesn't say anything about the really interesting relationships in detail. If you want to successfully implement new concepts for the cross-company connection of data, you need new, more in-depth approaches to solutions. In the following, we want to show what the use of data derivatives instead of conventional data attributes can contribute here.
We will be happy to show you in detail how you can break new ground in data analysis with dab:AnalyticIntelligence. For further information or a personal meeting, please contact us. We will get in touch with you immediately!