LAraBench: Benchmarking Arabic AI with Large Language Models

At Qatar Computing Research Institute, we are committed to advancing Arabic Language Technologies to next levels. Although significant progress has been made, the journey ahead remains substantial.

With the advent of ChatGPT, many wonder whether Large Language Models (LLMs) have resolved the myriad challenges in the field, wondering what comes next. Our stance is clear, there is still a long way to go.

Our extensive research into benchmarking Arabic NLP and Speech tasks reveals significant insights, which has recently been accepted at EACL 2024

Interesting insights are:

  • Closed models, like GPT-4, are narrowing the performance gap with state-of-the-art (SOTA) benchmarks.
  • However, open models such as Bloom and Jais lag considerably behind the SOTA.
  • The majority of models struggle with tasks such as segmentation, Part-of-Speech (POS) tagging, and Named Entity Recognition (NER)
  • There is a gap in performance between MSA and SOTA

This study stands as the most comprehensive analysis to date for Arabic, encompassing 33 distinct tasks across 61 publicly available datasets, 98 experimental setups, and featuring 296K data points, 46 hours of speech, and 30 sentences for Text-to-Speech (TTS). An end-to-end comparison with SOTA.

We are confident this paper will be an invaluable asset to the community and an essential guide for those embarking on a career in Arabic NLP. Our work goes beyond mere benchmarking; we offer detailed descriptions and thorough accounts of both the tasks and datasets involved.

As a part of our commitment to support the community, we made all resources publicly available through our LLMeBench framework




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • ArAIEval Shared Task: Propagandistic Techniques Detection in Unimodal and Multimodal Arabic Content
  • Can GPT-4 Identify Propaganda?