Beyond MCQ: Open-Ended Arabic LLM Evaluation Presented at LREC 2026

We are happy to share that our work on moving Arabic LLM evaluation beyond multiple-choice questions was presented at LREC 2026 in Palma, Mallorca.


๐Ÿงญ Motivation

Many LLM benchmarks have relied heavily on multiple-choice questions, however, real users rarely interact with AI that way. They ask open-ended questions, switch dialects, expect context, and look for answers that reflect how language is actually used. Hunzalah Hassan Bhatti has been exploring this problem for Arabic-centric LLM evaluation, moving beyond MCQ accuracy toward open-ended benchmarking across English, Modern Standard Arabic, and multiple Arabic dialects.

๐Ÿงช Approach

The idea is practical and reusable: extend existing benchmark resources rather than building everything from scratch. The recipe includes:

  • Converting multiple-choice questions into open-ended questions
  • Involving native-speaker annotators for post-editing
  • Generating chain-of-thought rationales to fine-tune smaller models for step-by-step reasoning

An interesting aspect of the developed dataset is that it is parallel across multiple language varieties, which makes the evaluation directly comparable.

๐Ÿ”‘ Key Takeaways

  • MCQ accuracy can make a model look stronger than it really is.
  • Cultural and linguistic competence becomes clearer when models must answer openly โ€” especially across dialects.
  • Model capabilities can be improved by ingesting dialectal knowledge.

๐Ÿ”— Resources




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • ArAIEval Shared Task: Propagandistic Techniques Detection in Unimodal and Multimodal Arabic Content
  • ArAIEval Shared Task: Propagandistic Techniques Detection in Unimodal and Multimodal Arabic Content
  • Can GPT-4 Identify Propaganda?
  • LAraBench: Benchmarking Arabic AI with Large Language Models