Splitting Thai words with Browser APIs

One of the difficulties for students learning the Thai language is the lack of spaces between words. For example, here is a simple sentence in Thai:


After some time, students start to recognize the shapes of the words and it becomes effortless to read. Before that time, however, it’s quite a struggle!

I found a neat trick yesterday for splitting Thai sentences into words with JavaScript in Chrome. There’s no native API for understanding Thai text, but it’s possible to piggyback on top of the browser’s text selection APIs.

When a word is double-clicked in the browser, the browser will select that word—and this selection is localization-aware! This functionality is exposed in the non-standard Selection.modify API. While this API is present in every browser, I’ve found it only works on Thai text in Chrome.

Selection.modify is a bit like the old-school "turtle" game where a pen is given a command with a direction and a distance, and it draws a line in that direction for that distance. In our case, we’re "drawing" the selection of text. The API looks like this:

sel.modify(alter, direction, granularity)

For example, sel.modify("extend", "forward", "word") would extend the current selection forward by one word. By measuring the selection range after each call, we can get the indices of each word.

The actual code ends up being pretty short:

<div class="input">สวัสดีครับกินข้าวหรือยัง</div>
<button>↓ split ↓</button>
<div class="output"></div>
        .addEventListener("click", () => {
            const input = document.querySelector('.input');
            const output = document.querySelector('.output');
            const sel = window.getSelection(); // our selection api

            // set selection range to [0, 0]
            output.textContent = '';
            sel.collapse(input, 0);
            var start = 0;
            var end = 0;

            // instruct the browser to select each word, then read the
            // selection and output it.
            while(end < input.textContent.length) {
                sel.modify('extend', 'forward', 'word');
                end = sel.focusOffset;
                const word = input.textContent.substring(start, end);
                start = end;

                output.textContent += word + "  ";
        }, false);

And here is the result:


I think this is a pretty neat trick!

tagged as , and