Content Class: Resource Management
Verifiability Rating: 0.8
The training data (ca. 1000 samples), development data (ca. 300 samples), and evaluation data (ca. 500 samples) are constructed from publicly available German-language company reports indexed in the German Sustainability Code (Deutscher Nachhaltigkeitskodex, DNK).
DNK reports always follow the same structure, consisting of 20 sections, each corresponding to a reporting criterion (e.g. `Incentive Systems' or `Usage of Natural Resources'). Each criterion section not only deals with a separate topic, but also fulfills a particular communicative purpose, which is reflected in the hierarchical structure of the report outline.
One goal of this shared task is to determine the extent to which the texts pertaining to the different sections diverge not only in content but also style and other linguistic properties.
Each input to be analyzed in Tasks A and B is a text snippet of 4 consecutive sentences. Text snippets are selected semi-automatically, based mostly on balanced random sampling, with some filtering steps to exclude structured data such as tables, and personally identifiable information.
The text snippets are preprocessed with a named entity recognition (NER) tool, and then checked manually for further personally identifiable information. Personally identifiable words or phrases are replaced by one of the tags below:
Location names (e.g. Berlin) and general terms for types of companies (e.g. Sparkassen) are not anonymized, except if they are part of the name of a specific organization (e.g. Stadtsparkasse Augsburg). Certain large government and non-government organizations referenced in their role of establishing sustainability reporting standards, such as laws and certificates, are not anonymized.
Task A: The challenge is to assign a suitable content class to each text sample. The label for each instance is the name of the DNK reporting criterion section the text snippet was sampled from.
Task B: The challenge is to rate the verifiability of the statement (e.g. goal or state description) made in the last sentence of each text snippet, with the previous sentences given as context for better understanding. We use a numerical score between 0.0 (not verifiable) and 1.0 (clearly verifiable), and predictions will be evaluated by their Kendall Tau-b rank correlation with human ratings.
Task B ratings are derived from human annotation on a four-point scale:
The annotation was executed via paid crowdsourcing. We collected ~5 crowd annotations per sample, took the majority vote, and in the case where the vote was tied, computed the arithmetic mean between the tied values. In these cases, we also report the standard deviation over the tied values (in cases where there was a unique majority vote, the standard deviation is 0.0). The information about the standard deviation is not strictly part of the shared task, but may be used by participants to gain insight into the uncertainty/difficulty of individual samples.
IMPORTANT: Participants should not rely on standard deviation (task_b_stdev) and publication year of the sample to make their predictions, as this information may not be given in the final evaluation data. The evaluation data will also contain reports from the years 2022 and 2023, whereas the training, development, and validation data only spans 2017-2021.
Can I submit my results to special tracks?