Information and Data Management
TERRECO Cluster S-04
Von 06/2013Projektleiter: Gerhard Rambold
Abstract 2013: Long-term data storage, processing and archiving infrastructure is a major issue for interdisciplinary projects in the environmental sciences in order to provide: 1) for overall analysis and synthesis towards the end of a project and 2) for trans-project provision of data, specifically in the context of open and free access data policies. Interdisciplinary projects gather a broad range of data describing quantitative and qualitative observations, physical measurements and social relationships determined from interviews. Observational data are gathered in biodiversity research, where occurrence and abundance of specific organisms are registered. Resulting data is obligatorily connected with taxonomic information, i.e., names, trait data, and space-time co-ordinates. Hydrological, soil chemical, or ecophysiological analyses encompass georeferenced time series measurements and are linked with so-called 'metadata', describing the aspects of measurement conditions. Socio-economic interview or questionnaire data is less standardardizable information, and necessitates highly flexible data entry, management and archiving structures. Finally, large-scale project level metadata needs to be considered, mainly concerning project and person or agent management aspects as well as identities and traits of the observation/exploration sites.
Within TERRECO all of the primary and secondary data types indicated above are gathered. A major challenge was to establish appropriate data flow from data collection and primary analysis, via centralized storage, towards secondary data provision and long-term archiving. During the first phase, research students stored their observational and measurement records in spreadsheets or CSV formatted text files. Socio-economic information was primarily recorded on questionnaires sheets followed by transfer on spreadsheets. The TERRECO data flow architecture now includes four major data storage units with data transfer interfaces. A (1-3; numbering referring to overview poster]: a data set collecting unit for primary data uptake and transformation into technical standard formats; B [4]: a data management platform ('Diversity Workbench') for data transformation, harmonisation, quality control, correction and enrichment by adding information from external thesauri or gazetteer web services during the project or with completion of a subproject, and for subsequent delivery towards data provision portals and long-term archiving structures (data repositories); C (5): a project-specific data provision platform (‘BayEOS’ and ‘GeoServer’) offering a wide spectrum of easy-to-use data analysis options (e.g., via R scripts) with access to TERRECO data from platforms A and B; D [6, 7]: Data export interfaces supporting various kinds of global XMK-based data standards like SDD and EML, guarantee the successful final transfer of data from platforms A and B to institutional data repositories.
Via this data flow structure, TERRECO will be able to meet requirements for consistent, quality-controlled, and up-to-date content data, and can fullfill the need of national data centers for a sustainable treatment of scientific data.
Key words: data management, diversity workbench, data archive, XMK data standard, SDD, EML