Scientific dataset analysis and synthesis using JeDaSS

Alsayed Algergawy1, Hamdi Hamed1, Birgitta König-Ries1
1 Heinz-Nixdorf Chair for Distributed Information Systems at Friedrich-Schiller-Universität Jena

V 15.7 in Grundwasserqualitätsentwicklung – Erkenntnisse aus Langzeitstudien in der Kritischen Zone

24.03.2022, 12:15-12:30, HS 2

In order to understand the critical zone a variety of disciplines stemming from various domains including biology, geology and  chemistry need to work together. In particular, data reflecting these different viewpoints need to be integrated. This is also true for the Collaborative Research Center Aquadiva which  investigates how environmental conditions and surface properties shape the structure, properties, and functions of the subsurface. For this purpose, large volumes of heterogeneous observational data are being collected, including for example data from goundwater wells, lysimeters and drainage collectors on chemical properties of groundwater and the microbial species occurring in there, climate data, and soil data. Going from the analysis of these individual datasets to a joined investigation is challenging, though. Already finding and understanding datasets from another domain requires significant effort. This is even true for scientists working in other parts of the same project. They, too, will need time to figure out the major theme of unfamiliar datasets. We believe that dataset analysis and summarization can be used as an elegant way to provide a concise overview of an entire dataset. This makes it possible to restrict time-intensive in-depth exploration to datasets of potential interest, only.

To gain an understanding of the complex study system, there is a growing need to have a summarized overview of this interlinked body of knowledge.  To this end, we develop JeDaSS, the Jena Dataset Summarization and Synthesis tool, which semantically classifies data attributes of tabular scientific datasets based on a combination of semantic web techniques and deep learning. This classification contributes to summarizing individual datasets, but also to link them to others. We believe that figuring out the subject of a dataset is an important and basic step in data summarization. To this end, the proposed approach categorizes a given dataset into a domain topic. With this topic, we then extract hidden links between different datasets in the repository. The proposed approach has two main phases: 1) off-line to train and build a classification model using a supervised deep learning approach and 2) online making use of the pre-trained model to classify datasets into the learned categories. To demonstrate the applicability of the approach, we analyzed datasets from the AquaDiva Data Portal.



Algergawy, A., Hamed, H., and Konig-Ries, B. (2021): Towards scientific data synthesis using deep learning and semantic web.  – In: Proceedings of ESWC. Springer, 54-59.

Wollmer, M., Golab, L., Bohm, K., Strivastava, D. (2019) Informative summarization of numeric data. In Proceedings of the 31st SSDM, pp. 97-108



Export as iCal: Export iCal