1. Project Background and Justification
Information technology in the current scenario is evolving as an effective tool for making information wide spread and available online to several communities at large. In one hand, the increased use of ICT is enabling people across the globe to participate in the knowledge network; at the same time larger populations in the rural areas of developing country like Nepal are being deprived of the benefits of the use of ICT. One of the main reasons for this seems to be the language barrier. For a person who is still unaware of Spoken as well as written English, it certainly becomes a difficult task to work with computer systems with English language interfaces. Further, the presence of documentations, system manuals, help etc in non local language (mostly English) makes it even more difficult for the rural population to learn to start with computers.
With efforts from some organizations in Nepal, local language computing is slowly gaining its momentum. The International Development and Research Development Program (IDRC), Canada, through its Pan Asia Networking Program and National University of Computer and Emerging Sciences, Pakistan, through its Centre for Research in Urdu Language Processing (CRULP) have their initial three-year partnership with Madan Puraskar Pustakalaya, Kathmandu University and Tribhuvan University in Nepal to build capacity in regional institutions for local language Computing. The scope of this work includes localization of Linux (Nepalinux) Localization of Open Office, GNU Cash (Open Source Accounting System), and Linux distribution. Similarly, in an effort to localize Microsoft Windows and Microsoft Office in Nepali Language, Microsoft Corp, USA and Unlimited Numedia Pvt Ltd, Nepal have signed an agreement to have Windows operating system and some of the applications from the popular desktop application suite of the products of Office 2003.
These efforts as they materialize will hopefully establish a strong foundation for the language computing related future works in Nepal. These initiatives however have left “Machine Translation” clearly out of their scope. Technological support in the form of machine translation system is of significant importance to Nepal. Such translation system could find its proper and effective use in several different sectors like education, administration, commerce, tourism etc.
As an initiative towards augmenting the currently ongoing activities in local language computing in Nepal, we propose to develop and implement a machine translation system for translating English language texts and web-pages into Nepali. Named as “DHOBASE” meaning a translator in Nepali, this application will be an online utility that could be used in English to Nepali translation.
2. Objectives
2.1. General Objective
To bridge the existing digital divide by making the information available in English language to the larger non-English speaking population of Nepal, in Nepali language.
2.2 Specific Objectives:
- To develop and implement an online machine translation system that could convert English text (including on the fly web page translation) into corresponding Nepali representation.
- To study, design and implement the following in the Machine Translation System: Bilingual English-Nepali Dictionary, English Morphological Analyzer, English Parser, Nepali Morphological Analyzer, Nepali Generator, Transfer Rules, Web interface for the MT system, Integrated MT Engine.
3. Project Beneficiaries
- Potential users of the Computer System who do not have the knowledge of English language (including those in the rural areas of Nepal where Computer and Internet is accessible through tele-centers) can use this MT system to translate the English text into corresponding Nepali representation.
- Users in administration, educational institutes, tourism etc can use the MT system for English-Nepali translation. Example: those in Village Development Committee (VDC) offices, Nepali Medium Primary Schools, tourism industries etc.
- Organizations requiring translation of manuals and other documents in Nepali, example: development organizations, and related NGOs/ INGOs.
- The students and faculty members involved in the research and development will find a platform to explore and consolidate their knowledge in Machine Translation and Local language computing at large. Further, this project could provide future reference as well as a platform to carry on further research activities in connection with language computing.
- The community at large will be benefited by the translation system as it would be available online as well as through a freely distributed software package.
4. Project Sustainability
The Language Processing Research Unit (LPRU) at the Department of Computer Science and Engineering, Kathmandu University will be taking this project as one of its activities to boost research and development in the field of language technology and local language computing. The outcome of the project would be hosted online by LPRU for free access to all. Beside, further research and development based on the outcome of this project would be carried out through other activities of at LPRU in the future.
5. Project Methodology
5.1 The Development Methodology
The MT system as a whole comprises of two subsystems; the Interface part and the Translation Engine, both of which will be hosted in the same machine. Web server receives user’s request to translate texts in English through web pages. Server side scripting (to be executed by the web server) will be used to obtain such requests and handle it to the main translation engine. The main translation engine will translate English texts in to Nepali and send it back to the server side scripting codes which will ultimately send the output to the client/user machine.
Figure 5.1: “DOBHASE” Overall General Architecture
User can also input URL of the web pages in English that is to be translated to Nepali. In this case the web server gets the requested English web pages, extracts the English contents and gives it to Translation Engine. The translation engine translates gives it back to the web server. Web pages in Nepali, will be then created by the web server and sent back to the user.
English-Nepali Machine Translation Engine will use the transfer based Machine Translation approach. In this approach, English texts will be analyzed and its parse tree will be generated first. English morphological rules will be used for the morphological analysis and English lexical functional grammar for syntax analysis. Then, transfer rules will be applied to the analyzed English text. This will include syntactic transfer and lexical transfer. There will be several syntactic transfer rules. Lexical transfer will mainly be guided by the English-Nepali bilingual lexicon. This lexicon will contain stem words only because there will be separate English Morphological analyzer and Nepali Morphological generator. After application of transfer rule, Nepali text will be generated using generation rules and the lexicon. Nepali morphological rules will be applied for proper Nepali word formation.
Figure5.2: Architecture of “DOBHASE” Translation Engine using Transfer based approach.
5.2 Working Methodology
The project location will be the Language Processing Research Unit (LPRU) premises at Kathmandu University, Dhulikhel and Madan Puraskar Pustakalaya(MPP), Lalitpur, Nepal . Most of the development work will be carried out at LPRU whereas MPP will provide the linguistic and other expertise required. As far as the human resource is concerned, LPRU will provide the System Analyst, one Senior Software Developer, two Junior Software Developers and the typist, and MPP would be providing the Project Manager, Senior Linguist, Junior Linguist and one Senior Software Developer.
5.3 Budget
The project budget comprises mainly of the remuneration (research allowances) for the personnel involved. The applicant and its partner have made a conscious effort to minimize the budget overheads and will bear the administrative and other miscellaneous expenses required for the budget. The proposed budget for the applicant includes the remuneration for the System Analyst, One Senior Developer, Two Software Developers, One Typist and the cost for two personal computers and one printer. Similarly, the proposed budget for the partner comprises of the remuneration for the Project Manager, Senior Linguist, Junior Linguist, Senior Software Developer and one personal computer.
6. Project Time-line (Activity Based)
7. Project Outputs
The output of the project will be:
- Bilingual English-Nepali Dictionary
- English Morphological Analyzer
- English Parser
- Nepali Morphological Analyzer
- Nepali Generator
- Transfer Rules
- Web interface
- Integrated Machine Translation Engine
- Project Monitoring
Text of some size will be run against each prototype developed at the end of the iteration to evaluate it. The evaluations will be from different perspectives. The results obtained from the evaluations will help to monitor the project and improve it. The following evaluation metrics are to be used:
8.1 Quality Assessment
- Fidelity Test: This includes the measure of extent to which the translated text contains the same information as the original.
- Clarity Test: The ease with which a reader can understand the translation is measured in this test.
- Style Test: Test regarding the extent to which the translation uses the language appropriate to its content and intention are performed in this test.
8.2 Error Analysis
During evaluation of MT systems, in most instances the most useful practical information is obtained from the error counting. It is an index of the amount of work required to correct ‘raw’ MT output to a standard considered acceptable as a translation. In a typical case, the reviser can count each addition or deletion of word, each substitution of one word by another, each instance of the transposition of words in phrases, and calculated the percentage of corrected words (errors) in the whole text.
Some lexical errors are easily resolved by simple changes to the dictionaries, while others may have implications for grammatical rules and for a whole range of vocabulary items.