Daleel 2026: Arabic Argumentative Discourse Mining Shared Task

Overview

Argument mining is increasingly important for understanding opinions, supporting discussions, enabling explainable AI, analyzing media, developing educational tools, and evaluating language models. However, Arabic remains under-resourced in this area, particularly for detailed, discourse-level argument mining. The primary aim is to identify argumentative discourse units in two main forms of Arabic argumentative discourse: editorials and debates, and classify their types.

The shared task focuses on the following types:

Common ground: The unit states common knowledge, a self-evident fact, an accepted truth, or similar. It refers to general issues, not to specific events. Even if not known in advance, it will be accepted without proof or further support by all or nearly all possible readers.
Assumption: The unit states an assumption, conclusion, judgment, or opinion of the author, a general observation, possibly false fact, or similar. To make readers accept it, it is or it would need to be supported by other units.
Testimony: The unit gives evidence by stating or quoting that a proposition was made by some expert, authority, witness, group, organization, or similar.
Statistics: The unit gives evidence by stating or quoting the results or conclusions of quantitative research, studies, empirical analyses of data, or similar. A reference may but need not necessarily be given.
Anecdote: The unit gives evidence by stating personal experience of the author, an anecdote, a concrete example, an instance, a specific event, or similar.
Other: The unit does not or hardly adds to the argumentative discourse or it does not match any of the above classes.

The shared task is named Daleel after the Arabic word دليل, which can mean evidence, proof, indication, argument, or guide. The name reflects the aim of the task: to identify the discourse units that guide readers through Arabic arguments, including common ground, assumptions, testimony, statistics, anecdotes, and other argumentative functions.

Tasks

Task 1: Argumentative Discourse Unit Classification

This is a multi-label classification task: given a paragraph, predict the types of all argumentative discourse units present.

Dataset: To be sent to participants upon registration
Evaluation: Systems will be evaluated using F1.

Example from news editorials:

                    إسرائيل في زمن اليمين المتطرف المدعوم من يمين أمريكي، لا يقل تطرفا، تنتقل من دائرة الحديث عن "أراضٍ متنازع عليها" أو "متفاوض حولها"، إلى مربع إخراج هذه المناطق من دائرتي التفاوض والتنازع، فهي أرض إسرائيلية بالكامل، لا مطرح عليها لشيء اسمه الضفة الغربية، هناك "يهودا والسامرة" فقط.

                    وهي بفعل "قانون القومية" الذي أقره الكنيست قبل السابع من أكتوبر/تشرين الأول بخمس سنوات، جزء لا يتجزأ من الأرض الموعودة لشعب الله المختار.. في تجسيد استفزازي للأيديولوجيا الدينية حين تصبح محركا للسياسة، أو تحل محلها.
                
This example includes two types of argumentative discourse: Assumption and Testimony

Task 2: Argumentative Discourse Unit Detection

This is a sequence tagging task: given a paragraph, detect the argumentative discourse units present in it, along with the exact spans in which each unit appears.

Dataset: To be sent to participants upon registration
Evaluation: Systems will be evaluated using a modified F1 that accounts for partial matching between the spans across the gold data and the predictions.

Example from debates:

                    أولا عندما نتحدث عن المجلس نحن نتحدث عن الدول المتقدمة تكنولوجيا 
                    و
                    لماذا نأخذ الدول المتقدمة تكنولوجيا و ليست الدول التي لا يوجد تكنولوجيا بها
                لأن، هؤلاء الدول التي لا توجد تكنولوجيا بها و لا تعتبر متقدمة تكنولوجيا ليست لديها خوف استبدال البشر بالتكنولوجيا أصلا، و
                 عندما نتحدث عن أنظمة التكنولوجيا و المعلومات نحن نتحدث عن العديد من التكنولوجيات التي الآن هي تعتبر التكنولوجيات الصاعدة في مجتمعنا ألا و هي مثلا مثل الذكاء الاصطناعي و الذي هو ال artificial intelligence الواقع الافتراضي و المدمج و المحسن augmented virtual and mixed reality  و الblock chain و أنظمة المعلومات و علوم البيانات أو الdata science.

                
This example includes 4 argumentative discourse units:

                    Assumption
                    Common Ground
                    Other
                

Evaluation Settings

To support both controlled comparison and realistic system development, we will report results under two evaluation settings: in-domain and cross-domain.

In-domain setting

In the in-domain setting, systems are trained and evaluated on the same type of text. For example:

Train on editorials and evaluate on editorials
Train on debates and evaluate on debates

This setting measures how well systems perform when the training and test data come from the same domain.

Cross-domain setting

In the cross-domain setting, systems are trained on one type of text and evaluated on another. For example:

Train on editorials and evaluate on debates
Train on debates and evaluate on editorials

This setting measures how well systems generalize across different forms of Arabic argumentative discourse.

Results for the in-domain and cross-domain settings will be reported separately.

Resources and Methods Settings

Participants may submit systems under two resource settings: Closed or Open. These settings are intended to make the results easier to compare and to clearly distinguish controlled systems from systems that use additional data or larger models.

Closed Track

The closed track is intended for fair and controlled comparison between systems. In this track, participants may use:

The training and development data provided by the organizers
Publicly available pretrained models
Open-weight LLMs with a maximum size of 70B parameters
Standard NLP tools, such as tokenizers, segmenters, morphological analyzers, POS taggers, or NER tools
Prompting, fine-tuning, rule-based methods, or hybrid methods

In this track, participants may not use:

Additional labeled datasets for training
Closed-weight or proprietary LLMs
Open-weight LLMs larger than 70B parameters
Manually created extra labeled training examples
Any information from the final test labels

Participants submitting to the closed track must clearly describe all models, tools, prompts, preprocessing steps, and postprocessing steps used.

Open Track

The Open track is intended for systems that use additional resources or larger models. In this track, participants may use:

Additional datasets for training
External resources
Closed-weight or proprietary LLMs
Open-weight LLMs of any size
Retrieval-based methods
Prompting, fine-tuning, data augmentation, translation, rule-based methods, or hybrid methods

Participants submitting to the Open track must clearly report all external resources used, including datasets, model names and versions, prompts, training procedures, preprocessing steps, and postprocessing steps.

Results from the closed and open tracks will be reported separately.

Prizes

Monetary prizes will be awarded to the authors of the top three system description papers submitted to the shared task. Papers will be evaluated based on the quality and originality of the proposed methodology.

1st place: $400
2nd place: $200
3rd place: $150

Tentative Timeline

May 22, 2026: Task website, documentation, and registration form released
June 5, 2026: Release of data, baselines, and evaluation scripts
July 25, 2026: Registration deadline and release of final evaluation input data
July 30, 2026: Systems submissions deadline and final evaluation
August 6, 2026: System description papers submission deadline
August 13, 2026: Notification of acceptance
August 22, 2026: (Mandatory) Camera ready submission of system papers
24–29 October, 2026: ArabicNLP/EMNLP Conference 2026, Budapest, Hungary

Registration

To participate in the shared task and access the dataset, please complete the team registration form. Submit your system outputs via our dedicated CodaBench pages below.

Team Registration Form

CodaBench Shared Task Pages:

Task 1

Open Track Closed Track

Task 2

Open Track Closed Track

Organizers

Sara Nabhani, University of Groningen, Netherlands
Nahla Bassyouni, QatarDebate, Qatar
Ali Al-Zawqari, Vrije Universiteit Brussel, Belgium
Mohammad Khader, QatarDebate, Qatar
Khalid Al-Khatib, University of Groningen, Netherlands