Show Menu

NLP For Arabic Cheat Sheet by

Arabic Processing cheatsheet, NLP.

Arabic script

36=28 conson­ant­s,6­(ؤأ­إآءئ), Ta ة, ِAlif ى
NLP Task - Orthog­raphic Transl­ite­ration:
19 letters shape & ك والهمزة والنقاط
Backwlater Transl­ite­ration model
Letter­shape: Initial, medial, final and isolate
NLP Task - Orthog­raphic Normal­iza­tion:
Cursive connected style
encoding cleanu­p->­complex ligature, ك
Tatweel and diacritics remove
Diacritic (Vowel, Nunation, Shadda, Dagger)
Letter normal­iza­tion: أإآ--> ا
Digits (Westren and eastren) - LTR
Hand writing - MADCAT
Punctu­ation [: . !"] [؟،] Tatweel
Automatic Diacri­tiz­ation - Only 1.5%
Typogr­aphy, growing font library
Other Languages Letter: Persian, Kurdish
Encoding: Unicode, ISO-8859, CP-1256

Challenges to Arabic NLP

Orthog­raphic ambiguity
Orthog­raphic incons­istency
Morpho­logical Complexity
Dialect Variation
Annotated resource poverty

Arabic Phonolgy

Minimal pair: /k//g/ قلب & كلب
MSA - ق /q/
/q/, /k/, /ʔ/, /g/, /ʤ/, /ɢ/
ث /θ/
/θ/, /t/, /s/
ذ /δ/
/δ/, /d/, /z/
ج /ʤ/
/ʤ/, /g/, /ʒ/

Orthog­raphy Connected to Phoneme

Orthog­raphy - Ambiguity

Optional Diacri­tiz­ation
Complex = No vowels? long vowels, initial
Arabic words has on average:
12.3 Analysis & 6.8 Diacritics & 2.7 lama
Morpho­-Ph­onemic Spelling issue 1:
الشمسية والقمرية || ة -> ه || عصا -> عصى
Morpho­-Ph­onemic Spelling issue 2:
التنوين ينطق نون - ألف واو الجماعة لاينطق
Standa­rdi­zation Issue
سوريا وسورية || فيلم وفلم || أفريقية وأفريقيا
NLP Task:
Proper Name Transl­ite­ration
Qaddafi problem (Kadafi, Qadafi.....)
Schwar­zen­egger Problem (شوارزينجر , شوارزينغر)
Hassan Problem
Marie problem ---> Los it to ماري

Arabic Spelling

Hamzated Alif and Alif maqsura 11%
Penn Arabic bank tree
(30%) of words have errors/out of 2M words
Qatar Arabic Language Bank
Arabic spelling errors are a big challenge
GIGO: Garbage In Garbage Out
Incons­ist­encies in Dialectal Arabic !=standard
َو+ ب+أدلة+ها
العَيـن = (eye, water spring, Alain city)
Spelling variants

Arabic Morphology

Morpho­logical Complexity
A core word has many inflected forms
Gender(2), Number(3), Person(3), Aspect(3)
Tense particle (2), Mood(3), Voice(2),
Pronominal clitic­(12), Conjun­ction clitic(3)
وسنقولها =/wasa­naq­ūluhā/= و+ س+ ن+ قول + ها
go went going gone go goes
Arabic POS tags: 22,400 tags
English POS tags: 48 tags
12.3 analyses and 2.7 lemmas per word
Functional Morphology جمع تكسير
Form based morphology التصريف الطبيعي
علم العروض. الفراهيدي - الكتابة العروضية
الكتاب الرمزية - /0//00
التنوين, الشدة, الاشباع والألف :هندن أخلل يحرمي هاذا

Tools and Papers

State-­of-­the-art Arabic and Arabic Dialect processing
Multi-­Arabic Dialect Applic­ations and resources
Simpli­fic­ation of Arabic Master­pieces for Extensive
A conven­tional orthog­raphy for Dialectal Arabic
Arabic lite Stemmer

POS For Arabic

Stanford Arabic parser tagset
Morpho­logical annotation Quranic Arabic corpus
Arabic MADA system tagset
POS tag set for Modern Standard Arabic
The APT tagger (Khoja, 2001) / hybrid
The Qutuf (Altabba et al., 2010) tagger
Al-Dahdah (1989),
TreeTagger by Schmid (1995)) - 20+ lang
Noun (اِسْم­)<A­iso­m>, Verb (فِعْل­)<f­iEo­l>, and Particle (حَرْف­)<H­aro­f>.
Each one of these categories has many subcat­ego­ries.

Help Us Go Positive!

We offset our carbon usage with Ecologi. Click the link below to help us!

We offset our carbon footprint via Ecologi


No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

            Python 3 Cheat Sheet by Finxter
          Natural Language Processing with Python & nltk Cheat Sheet