Daleel 2026: Arabic Argumentative Discourse Mining Shared Task
Overview
Argument mining is increasingly important for understanding opinions, supporting discussions, enabling explainable AI, analyzing media, developing educational tools, and evaluating language models. However, Arabic remains under-resourced in this area, particularly for detailed, discourse-level argument mining. The primary aim is to identify argumentative discourse units in two main forms of Arabic argumentative discourse: editorials and debates, and classify their types.
The shared task focuses on the following types:
- Common ground: The unit states common knowledge, a self-evident fact, an accepted truth, or similar. It refers to general issues, not to specific events. Even if not known in advance, it will be accepted without proof or further support by all or nearly all possible readers.
- Assumption: The unit states an assumption, conclusion, judgment, or opinion of the author, a general observation, possibly false fact, or similar. To make readers accept it, it is or it would need to be supported by other units.
- Testimony: The unit gives evidence by stating or quoting that a proposition was made by some expert, authority, witness, group, organization, or similar.
- Statistics: The unit gives evidence by stating or quoting the results or conclusions of quantitative research, studies, empirical analyses of data, or similar. A reference may but need not necessarily be given.
- Anecdote: The unit gives evidence by stating personal experience of the author, an anecdote, a concrete example, an instance, a specific event, or similar.
- Other: The unit does not or hardly adds to the argumentative discourse or it does not match any of the above classes.
The shared task is named Daleel after the Arabic word دليل, which can mean evidence, proof, indication, argument, or guide. The name reflects the aim of the task: to identify the discourse units that guide readers through Arabic arguments, including common ground, assumptions, testimony, statistics, anecdotes, and other argumentative functions.
Tasks
Task 1: Argumentative Discourse Unit Classification
This is a multi-label classification task: given a paragraph, predict the types of all argumentative discourse units present.
- Dataset: To be sent to participants upon registration
- Evaluation: Systems will be evaluated using F1.
وهي بفعل "قانون القومية" الذي أقره الكنيست قبل السابع من أكتوبر/تشرين الأول بخمس سنوات، جزء لا يتجزأ من الأرض الموعودة لشعب الله المختار.. في تجسيد استفزازي للأيديولوجيا الدينية حين تصبح محركا للسياسة، أو تحل محلها.
Task 2: Argumentative Discourse Unit Detection
This is a sequence tagging task: given a paragraph, detect the argumentative discourse units present in it, along with the exact spans in which each unit appears.
- Dataset: To be sent to participants upon registration
- Evaluation: Systems will be evaluated using a modified F1 that accounts for partial matching between the spans across the gold data and the predictions.
Evaluation Settings
To support both controlled comparison and realistic system development, we will report results under two evaluation settings: in-domain and cross-domain.
In-domain setting
In the in-domain setting, systems are trained and evaluated on the same type of text. For example:
- Train on editorials and evaluate on editorials
- Train on debates and evaluate on debates
This setting measures how well systems perform when the training and test data come from the same domain.
Cross-domain setting
In the cross-domain setting, systems are trained on one type of text and evaluated on another. For example:
- Train on editorials and evaluate on debates
- Train on debates and evaluate on editorials
This setting measures how well systems generalize across different forms of Arabic argumentative discourse.
Results for the in-domain and cross-domain settings will be reported separately.
Resources and Methods Settings
Participants may submit systems under two resource settings: Closed or Open. These settings are intended to make the results easier to compare and to clearly distinguish controlled systems from systems that use additional data or larger models.
Closed Track
The closed track is intended for fair and controlled comparison between systems. In this track, participants may use:
- The training and development data provided by the organizers
- Publicly available pretrained models
- Open-weight LLMs with a maximum size of 70B parameters
- Standard NLP tools, such as tokenizers, segmenters, morphological analyzers, POS taggers, or NER tools
- Prompting, fine-tuning, rule-based methods, or hybrid methods
In this track, participants may not use:
- Additional labeled datasets for training
- Closed-weight or proprietary LLMs
- Open-weight LLMs larger than 70B parameters
- Manually created extra labeled training examples
- Any information from the final test labels
Participants submitting to the closed track must clearly describe all models, tools, prompts, preprocessing steps, and postprocessing steps used.
Open Track
The Open track is intended for systems that use additional resources or larger models. In this track, participants may use:
- Additional datasets for training
- External resources
- Closed-weight or proprietary LLMs
- Open-weight LLMs of any size
- Retrieval-based methods
- Prompting, fine-tuning, data augmentation, translation, rule-based methods, or hybrid methods
Participants submitting to the Open track must clearly report all external resources used, including datasets, model names and versions, prompts, training procedures, preprocessing steps, and postprocessing steps.
Results from the closed and open tracks will be reported separately.
Prizes
Monetary prizes will be awarded to the authors of the top three system description papers submitted to the shared task. Papers will be evaluated based on the quality and originality of the proposed methodology.
- 1st place: $400
- 2nd place: $200
- 3rd place: $150
Tentative Timeline
- May 22, 2026: Task website, documentation, and registration form released
- June 5, 2026: Release of data, baselines, and evaluation scripts
- July 25, 2026: Registration deadline and release of final evaluation input data
- July 30, 2026: Systems submissions deadline and final evaluation
- August 6, 2026: System description papers submission deadline
- August 13, 2026: Notification of acceptance
- August 22, 2026: (Mandatory) Camera ready submission of system papers
- 24–29 October, 2026: ArabicNLP/EMNLP Conference 2026, Budapest, Hungary All deadlines are 11:59 pm UTC -12h (“Anywhere on Earth”)
Registration
To participate in the shared task and access the dataset, please complete the team registration form. Submit your system outputs via our dedicated CodaBench pages below.
CodaBench Shared Task Pages:
Organizers
- Sara Nabhani, University of Groningen, Netherlands
- Nahla Bassyouni, QatarDebate, Qatar
- Ali Al-Zawqari, Vrije Universiteit Brussel, Belgium
- Mohammad Khader, QatarDebate, Qatar
- Khalid Al-Khatib, University of Groningen, Netherlands
Contact Us
For inquiries, please reach out to the organizing team at:
arabic-argument-mining@argsbase.net