Bangla NLP | Firoj Alam

CRBLP web archive: http://web.archive.org/web/20150427192539/http://crblp.bracu.ac.bd/
ResearchGate: http://researchgate.net/project/Bangla-Language-Processing-2
Some projects hosted on SourceForge for public access: http://sourceforge.net/projects/blp/
Research Papers: http://dspace.bracu.ac.bd/xmlui/handle/10361/101/browse?value=Alam%2C+Firoj&type=author

The Bangla Language Processing research was supported by PAN Localization: http://www.panl10n.net/center-for-research-on-bangla-language-processing-crblp-bangladesh/

Bangla Word-embedding Model

Joint work: https://github.com/cogniinsight/Word-embedding-model-for-Bangla

Bangla Text to Speech

Led the team for the Bangla Text to Speech Project, aimed at developing a Bangla TTS system using the open-source festival engine developed by CSTR and the festvox tool from CMU speech group. The first version was publicly released on 19 February 2009, and is now available for download. This project was awarded as the most innovative project at BASIS Softexpo 2010 (http://www.softexpo.com.bd/about.php). It received a special award in the National E-Content and ICT4D Award 2010 (http://www.eaward.org.bd/). Developed the first audio version of Bangla newspaper “Prothom-alo”. See news on The Daily Star.

This project involves the development of the following components:

Phoneme Inventory: Conducted acoustic analysis to identify the total number of phonemes in the Bangla language. A small speech database was developed for this purpose.
Text Normalization: Developed two text normalization tools using rule-based systems in Java and Scheme.
Letter to Sound (LTS): Developed an LTS system to handle unknown words and to create a pronunciation lexicon.
Pronunciation Lexicon: Developed both manual and automatic LTS-based lexicons, containing 92K entries.
Intonation Modeling: Conducted initial work on labeling speech corpus for intonation modeling.
Diphone Database for TTS: Created a diphone database with 4355 diphones, including the design of nonsense sentences, professional recording, and labeling.
Speech Corpus: Developed a speech corpus for TTS, potentially applicable in ASR, with approximately 100K words, 18K unique words, and 10K sentences. Professional recording and sentence-level labeling were completed. http://sourceforge.net/projects/blp/files/Speech_Corpora/

Related publications: See here

ASR

Speech Recognition System for Bangla: Contributed to the development of a domain-dependent ASR prototype for an agro-based information system using the Sphinx framework.

Others

CRBLP Converter: Developed software to convert various TTF encoded Bangla documents to Unicode encoding. Contributed to font reengineering.
Corpus Analysis & Corpus Collection: Developed a tool for extensive corpus analysis on word frequency distribution and contributed to corpus collection and tool development.
Localized URL: Designed and developed local URL data, including domain names, character sets, gTLD, and ccTLD for Bangla Language.
Terminology System: Contributed to the development of a tool for searching, modifying, and adding new terminology.
CLDR: Contributed to the CLDR project from 2008 to 2010, involving data submission and moderation.
Microsoft Windows Vista & Office-2007 Localization Project: Participated in the translation of Windows Vista and Microsoft Office 2007 for the Microsoft Localization project.

Work related to Bangla Language

Bangla Word-embedding Model

Bangla Text to Speech

This project involves the development of the following components:

ASR

Others

References