Show Menu

NLP For Arabic Cheat Sheet by

Arabic Processing cheatsheet, NLP.

Arabic script

36=28 conson­ant­s,6­(ؤأ­إآءئ), Ta ة, ِAlif ى
NLP Task - Orthog­raphic Transl­ite­ration:
19 letters shape & ك والهمزة والنقاط
Backwlater Transl­ite­ration model
Letter­shape: Initial, medial, final and isolate
NLP Task - Orthog­raphic Normal­iza­tion:
Cursive connected style
encoding cleanu­p->­complex ligature, ك
Tatweel and diacritics remove
Diacritic (Vowel, Nunation, Shadda, Dagger)
Letter normal­iza­tion: أإآ--> ا
Digits (Westren and eastren) - LTR
Hand writing - MADCAT
Punctu­ation [: . !"] [؟،] Tatweel
Automatic Diacri­tiz­ation - Only 1.5%
Typogr­aphy, growing font library
Other Languages Letter: Persian, Kurdish
Encoding: Unicode, ISO-8859, CP-1256

Challenges to Arabic NLP

Orthog­raphic ambiguity
Orthog­raphic incons­istency
Morpho­logical Complexity
Dialect Variation
Annotated resource poverty

Arabic Phonolgy

Minimal pair: /k//g/ قلب & كلب
MSA - ق /q/
/q/, /k/, /ʔ/, /g/, /ʤ/, /ɢ/
ث /θ/
/θ/, /t/, /s/
ذ /δ/
/δ/, /d/, /z/
ج /ʤ/
/ʤ/, /g/, /ʒ/

Orthog­raphy Connected to Phoneme

Orthog­raphy - Ambiguity

Optional Diacri­tiz­ation
Complex = No vowels? long vowels, initial
Arabic words has on average:
12.3 Analysis & 6.8 Diacritics & 2.7 lama
Morpho­-Ph­onemic Spelling issue 1:
الشمسية والقمرية || ة -> ه || عصا -> عصى
Morpho­-Ph­onemic Spelling issue 2:
التنوين ينطق نون - ألف واو الجماعة لاينطق
Standa­rdi­zation Issue
سوريا وسورية || فيلم وفلم || أفريقية وأفريقيا
NLP Task:
Proper Name Transl­ite­ration
Qaddafi problem (Kadafi, Qadafi.....)
Schwar­zen­egger Problem (شوارزينجر , شوارزينغر)
Hassan Problem
Marie problem ---> Los it to ماري

Arabic Spelling

Hamzated Alif and Alif maqsura 11%
Penn Arabic bank tree
(30%) of words have errors/out of 2M words
Qatar Arabic Language Bank
Arabic spelling errors are a big challenge
GIGO: Garbage In Garbage Out
Incons­ist­encies in Dialectal Arabic !=standard
َو+ ب+أدلة+ها
العَيـن = (eye, water spring, Alain city)
Spelling variants

Arabic Morphology

Morpho­logical Complexity
A core word has many inflected forms
Gender(2), Number(3), Person(3), Aspect(3)
Tense particle (2), Mood(3), Voice(2),
Pronominal clitic­(12), Conjun­ction clitic(3)
وسنقولها =/wasa­naq­ūluhā/= و+ س+ ن+ قول + ها
go went going gone go goes
Arabic POS tags: 22,400 tags
English POS tags: 48 tags
12.3 analyses and 2.7 lemmas per word
Functional Morphology جمع تكسير
Form based morphology التصريف الطبيعي
علم العروض. الفراهيدي - الكتابة العروضية
الكتاب الرمزية - /0//00
التنوين, الشدة, الاشباع والألف :هندن أخلل يحرمي هاذا

Tools and Papers

State-­of-­the-art Arabic and Arabic Dialect processing
Multi-­Arabic Dialect Applic­ations and resources
Simpli­fic­ation of Arabic Master­pieces for Extensive
A conven­tional orthog­raphy for Dialectal Arabic
Arabic lite Stemmer

POS For Arabic

Stanford Arabic parser tagset
Morpho­logical annotation Quranic Arabic corpus
Arabic MADA system tagset
POS tag set for Modern Standard Arabic
The APT tagger (Khoja, 2001) / hybrid
The Qutuf (Altabba et al., 2010) tagger
Al-Dahdah (1989),
TreeTagger by Schmid (1995)) - 20+ lang
Noun (اِسْم­)<A­iso­m>, Verb (فِعْل­)<f­iEo­l>, and Particle (حَرْف­)<H­aro­f>.
Each one of these categories has many subcat­ego­ries.


No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

          Natural Language Processing with Python & nltk Cheat Sheet
            Python 3 Cheat Sheet by Finxter