There has been a meteoric rise in the amount of multilingual content on the web. This is primarily due to social media sites such as Facebook, and Twitter, as well as blogs, discussion forums, and reader responses to articles on traditional news sites. Language usage statistics indicate that Chinese is a very close second to English, and could overtake it to become the dominant language on the web. It is also interesting to see the explosive growth in languages such as Arabic. The availability of this content warrants a discussion on how such information can be effectively utilized. Such data can be mined for many purposes including business-related competitive insight, e-commerce, as well as citizen response to current issues. This talk will begin with motivations for multilingual text mining, including commercial and societal applications, digital humanities applications such as semi-automated curation of online discussion forums, and lastly, government applications, where the value proposition (benefits, costs and value) is different, but equally compelling.
There are several issues to be touched upon, beginning with the need for processing native language, as opposed to using machine translated text. In tasks such as sentiment or behaviour analysis, it can certainly be argued that a lot is lost in translation, since these depend on subtle nuances in language usage. On the other hand, processing native language is challenging, since it requires a multitude of linguistic resources such as lexicons, grammars, translation dictionaries, and annotated data. This is especially true for “resourceMpoor languages” such as Urdu, and Somali, languages spoken in parts of the world where there is considerable focus nowadays. The availability of content such as multilingual Wikipedia provides an opportunity to automatically generate needed resources, and explore alternate techniques for language processing.
The rise of multilingual social media also leads to interesting developments such as code mixing, and code switching giving birth to “new” languages such as Hinglish, Urdish and Spanglish! This phenomena exhibits both pros and cons, in addition to posing difficult challenges to automatic natural language processing. But there is also an opportunity to use crowd-sourcing to preserve languages and dialects that are gradually becoming extinct. It is worthwhile to explore frameworks for facilitating such efforts, which are currently very ad hoc. In summary, the availability of multilingual data provides new opportunities in a variety of applications, and effective mining could lead to better cross-cultural communication.
(i) Motivation for mining multilingual text.
(ii) The need for processing native language (vs. machine translated text).
(iii) Multilingual Social Media: challenges and opportunities, e.g., preserving languages and dialects.
Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), 2011, Chiang Mai, Thailand, available at http://www.ijcnlp2011.org/proceeding/IJCNLP2011MMAIN/index.html.
S. Mukund, R. K. Srihari and E. Peterson, "An Information Extraction System for Urdu - A Resource-Poor Language," ACM Trans. Asian Lang. Inf. Process. 9(4): 15 (2010)
Proceedings of the 5th International Conference on Weblogs and Social Media (ICWCM 2011), Barcelona, Spain., available at http://www.aaai.org/Library/ICWSM/icwsm11contents.php