Posting in Khmer, Part 2
In my previous post, I wrote a song text down in Khmer. Before I could post it on this weblog, there were some technical issues to resolve like fonts and the like. Some issues were resolved with the help of some of you out there. By this, thank you all.
The last few days, I was struggeling with the following: a text written in Khmer appears as long sentences without word separators. In a western language, we put ‘whitespace’ characters between the words. ‘Whitespace’ characters is the technical term for blanks, tabs, newlines, etc… At first sight, written khmer does not. How would search engines like Google handle this? The basic elements a search engine works with are words (also called terms). Do they break these long sentences into words? Are there any rules one can use to develop some piece of software?
I found a Java program (khwrdbrk.jar) that can do the job, or at least tries to. If you give this program a Unicode file with a Unicode encoded text in khmer, you get an output file in Unicode. But when I opened the output file, the text looked exactly like the original text. The output file on the other hand was bigger than the original?! The reason why is that this program inserts indeed a word break character between each word. This word break caracter is called the ZWSP. It is not visible, or has a zero width! I tested the program on one of my texts. The result is not 100% correct. Some word boundaries were not found.
I was so proud of my previous post, but now I have to admit that it is NOT what it should be. Time for a second try!
The song text from the previous post is from a song sang by a boy. This song is based on an older song, sang by a girl and made popular by the singer Oeun Sreymom. Her song in its turn is the khmer version of an even older Khmer Surin song.
The song is about a girl that saved some money, sold three chickens for money (without telling her mother), just to buy herself a new shirt (shirt: អាវ) which she (and the boys) likes so much that she does not want to take it off (undress: ដោះ). It is this last word (breast: ដោះ) that made others to put new lyrics on the same music. And this introduces a classic word play because of the double meaning of “ឃើញដោះ”. These two words can be interpreted as “wearing a shirt that none seen taking off” or “wearing a shirt that never shows her boobs”.
The text of the song, and now with the necessary ZWSP’s, goes as follows:
អាវថ្មីមិនខ្ចីដោះ
មានអាវមួយសន្សំលុយយូរខៃ
ស្រលាញ់ម្លេះទេ អាវថ្មីចេញម៉ូតស្រស់
ពាក់អោយកេដឹង ថាមិនដែលឃើញដោះ
ទៅនេះមកនោះ ពាក់តែអាវមួយហ្នឹង
អាវជិតកល្អត្រូវចិត្តស្រី
លក់មេមាន់បី មិនអោយម៉ែគត់ដឹង
ពាក់ដើររាល់ថ្ងៃ ប្រុសលួចសម្លឹង
ខ្លះស្ទើរភ្លឹកព្រលឹង សរសើរស្រលាញ់ខ្ញុំ
ពាក់អាវមិនដែលឃើញដោះ ពាក់អាវមិនដែលឃើញដោះ
គិតអីឬវាយ៉ាងណា យ៉ាងម៉េចស្រីង៉ា បងមិនដែលឃើញដោះ
អាវថ្មីខ្ញុំមិនខ្ចីដោះ អាវថ្មីខ្ញុំមិនខ្ចីដោះ
កំលោះនាំគ្នាចោមរោម បើសរសើរខ្ញុំ ខ្ញុំរិតតែលែងដោះ
រូបរាងស្រីទាំងសម្ដីវាចា
ឬកពាចរិយា អាវថ្មីឆើតស្រស់
មានអាវថ្មីមួយ ពាក់មិនខ្ចីដោះ
ប្រុសណាស្រណោះ ចូលដល់យាយតា
រូបបងប្រុសមិនយល់សោះចិត្តស្រី
ស្រុកសីវីល័យ ស្រីតែងខ្លូនសង្ហា
ពាក់មិនឃើញដោះ អាវនោះយ៉ាងណា
បើប្រុសសង្ហា ស្វែងយល់ខ្លូនឯង
ពាក់អាវមិនដែលឃើញដោះ ពាក់អាវមិនដែលឃើញដោះ
កើតអីឬវាយ៉ាងណា យ៉ាងម៉េចស្រីង៉ា បងមិនដែលឃើញដោះ
អាវថ្មីខ្ញុំមិនខ្ចីដោះ អាវថ្មីខ្ញុំមិនខ្ចីដោះ
កំលោះនាំគ្នាចោមរោម បើសរសើរខ្ញុំ ខ្ញុំរិតតែលែងដោះ
I checked it carefully: all ZWSP’s are there! I will keep an eye on Google to see if a search on one or more of the above words will hit this weblog.
I will correct the previous post asap.
April 11th, 2007 at 4:16 pm
I have corrected the previous post. All ZWSP’s are added to the khmer part.
May 3rd, 2007 at 7:41 pm
Hey, are you Khmer, Khmer Belgian, or Belgian? Your Khmer language comprehension is excellent!
May 3rd, 2007 at 9:36 pm
I am a pure Belgian, born in Belgium, from Belgian parents…
My wife is from Cambodia. She left Srok Khmer in 1980, just after the war.
We have 3 children: Pisey (XX), Voleak (XY) and Bopha (XX) -or- Eline, Nicolas and Emily.
Every year, we go to Srok Khmer, during summer, when the kids are on holiday, and this since 2002.
This year it will be the 7th time…
The postings are mine, with just some corrections by my wife!
(Guess what the XX and XY stand for…)
May 9th, 2007 at 6:49 pm
That’s cool!
OK, let me guess! XX is female, XY is PD (YY is male).