In today’s World Wide Web, people expect to read websites in their own language. Many Israeli websites now appear in Hebrew for Israelis, in English for the international community, and in other languages, such as Russian or Arabic. To make a website work properly in many languages, one must take care of a few issues. This article will deal with the general issues of making a website multilingual, and the next week’s article should discuss the specific issues of left-to-right vs. right-to-left.
Character Sets
Consistent use of character sets will make sure that the local-language letters are displayed correctly. Inconsistent use of character sets will cause the browser to display question marks or gibberish instead.
One character set for displaying Hebrew and English characters is ISO 8859-8. This standard is the same as ASCII for the first 127 characters, and uses some of the remaining 127 characters for Hebrew. Most older Hebrew/English websites use this standard, and it is still considered an official ISO standard.
Another option is UTF-8 (Unicode). This standard uses one byte for Latin characters and two bytes for international characters, such as Hebrew or Russian. This is the recommended character set because it allows one to have a single database that stores data for most common languages. To define a file as a UTF-8 file, one must define the character set in the editor. When a file is saved as UTF-8, it will contain the Byte Order Mark (BOM), so that editors will identify it as a UTF-8 file.
For the browser to identify the character set, it should be defined in the header. In PHP, it can be defined by adding the following line at the beginning of the page:
Alternatively, one may use the following line within the html section:
Pages loaded by AJAX must also have a character set header to make sure the section dynamically produced is rendered correctly.
Additionally, it is important to make sure the database and administration software are also set up for the correct character set, and any third party code generator needs to be configured for the character set as well.
Dictionary for texts
When the same site is rendered in several languages, it is important that every text display appears in the selected language. This includes regular texts, buttons, popup messages, emails, and everything else. This requires that every text display retrieve the text from a data dictionary. There are a few possible ways of storing a text dictionary.
· In the database
The advantage is that the data dictionary is completely separated from the code, so it is easier to have non-programmers manage the text dictionary. The disadvantage is that database data is not generally treated as static, and a database administrator could accidently erase the dictionary when erasing the test data.
· A formatted text file
This takes some extra work in setting up a format and developing functionality to read the file.
· A PHP file containing define statements
If one is doing it this way, it is advisable to have a naming convention to make it clear what is a language-dependent constant. Some frameworks, such as Smarty, use this method.
· A PHP file containing an array of definitions
The file would look something like this:
$dictionary_array=array
(
‘definition1’=>’text1’,
‘definition2’=>’text2’,
…
);
?>
Every text display would then need to check for the value in the array.
Database data in multiple languages
There are two possibilities in making the data in the database work with multiple languages.
· Extra fields in tables with multilingual data
For example, if there is a field “city” which needs to be rendered in both Hebrew and English, the table would have a field “city_he” and another field “city_en”. This solution works well if the site has only two or three languages and there is no need to add more languages. It is much easier managing all the data within one table with a fixed number of fields. However, if one wants to add morel languages, it would require modification of the database, and this means the site can never be left permanently to non-DBA’s to manage.
· An additional table for each table that has multilingual data
For example, if the table “cities” has the field “latitude” which is not language dependent and the field “name” which is different for each language, there would be 2 tables – “cities” and “cities_lang”. “cities” would contain the fields “id” and “latitude”, and “cities_lang” would contain the fields “id”, “city_id”, “lang”, and “name”. If the site is displayed in 5 languages, for each record in “cities”, there would be 5 records in “cities_lang”.
This makes queries more complicated, as an additional join is required, and it also makes more work each time one adds a record to “cities”. The advantage is that, when one adds more languages, there is no need to change the database structure or queries.
Defining current language
Another issue in multilingual websites is defining the current language. Here are some possibilities to determine in what language to display the content.
· Selected language
If the user selects a language to view the website, that’s what language it should be viewed in. To keep this language when traversing pages in the website, the programmer should store the current language either in the “session” or in a cookie.
· The language of the browser.
Usually, the settings of the browser contain one or more preferred languages. Below is some PHP code to detect if one of the browser’s languages is a language that the website may be displayed in. This code uses the international 2-letter code standard.
(Adapted from sample code in php.net)
if(isset($_SERVER[‘HTTP_ACCEPT_LANGUAGE’]))
{
$lang_list=$_SERVER[‘HTTP_ACCEPT_LANGUAGE’];
$lang_list=split(‘;’, $lang_list);
$lang_list=$lang_list[0];
$lang_list=split(‘,’,$lang_list);
foreach($lang_list as $lang)
{
$lang_parts=split(‘-‘,$lang); // treat en-us as “en”
if (language_exists($lang_parts[0]))
{
$selected_language=$lang_parts[0];
break;
}
}
}
?>
· Default language
When all else fails, there needs to be a default language to display the content in. If there is a large international audience, it will usually be English. If the site is intended for use within Israel only, it will usually be Hebrew.
Conclusion
We see here that, while it is much more work having one dynamic website displayed in many languages, if one follows these steps, one can have a website rendered in an unlimited number of languages.