Writing a Cantonese input method add-on for fcitx5

@lokxii.bsky.social

Writing a Cantonese input method add-on for fcitx5

Yet another article about me writing tool I need for myself. This time I am writing Cantonese input method add-on for fcitx5.

Why

About 6 months ago I bought an M2 Pro MacBook Pro, and decided to install asahi linux on it. I spent a few days setting up stuff and replace the default KDE Desktop with Sway. I also wrote scripts1 and built tools to boost my daily workflow.

See, as you have probably noticed from the title, I am not someone coming from European countries nor a citizen of the great United States. I needed to type Chinese (and Japanese because I found myself having fun in a Japanese community). On linux there are 2 main input method frameworks, ibus and fcitx. Both are preinstalled on the version of asahi linux I am using. For reasons I already forgot, probably no reason at all, I chose fcitx2.

For inputting Japanese, I chose mozc because someone recommended it on Reddit. For Chinese, I chose rime because someone recommended it on Reddit. NOT A REDDIT USER. The Reddit pages were what Google gave me.

6 months passed, I was pretty satisfied with mozc, but not for rime. There are multiple ways to type Chinese. I don't know Congkit because I never really spend the time to learn it, and I could simply never get used to Jyutping. Cantonese romanization is very interesting in the sense that unlike mandarin, there are several systems of romanization and no official standard. There are systems like Jyutping, Yale and Cantonese Pinyin. There are also unofficial systems created and used by students. I have been using this website since primary school that I have no idea what this romanization system is. I want it on my system. I guess it is pretty obvious that I am porting this website to fcitx5. Surely it won't take long.

Code is documentation

A quick google search lead me to this article as a starting point to write an add-on for fcitx5 - it turns out this is the only document I could find about developing an fcitx5 add-on.

Following the convention, code examples in the article didn't compile. However, the article linked me to a simple add-on that is simple enough for me to use as the basis. That doesn't mean I understand what I just copy-pasted.

Fcitx5 uses object oriented approach with the infamous factory pattern - wait, what is a factory? You see, my past experiences in programming are C3, C++4, Haskell5, and Rust6. I only do procedural and functional programming, but I guess with my statistically average IQ I can work out something.

I broke down the project into various stages.

  1. Remove all the input method logic from the sample project
  2. Write my own input method engine logic
  3. Integrate IME logic into the add-on skeleton

I spent a whole day adding printfs to the sample logic. It turns out add-on logic is pretty simple. You create an engine object that receives key events, pass the key events to a state object attached to an fcitx instance, finally on each key press generate a list of candidates and update UI. See? Code is documentation. I kid you not, the 335 lines main file is way more informative than the chunk of text in tutorial article. The problem is that HOW TF ARE YOU ALLOWED TO INHERIT FROM MULTIPLE PARENTS. Even Java doesn't let you do that.

class QuweiCandidateList : public fcitx::CandidateList,
                           public fcitx::PageableCandidateList,
                           public fcitx::CursorMovableCandidateList {

I never do OOP, but I could imagine the generated graph of class inheritance would be a total clusterfuck if we allow this kind of shit to happen everywhere.

ANYWAYS

I guess OOP inheritance mess isn't really a problem compared to the good old segmentation fault. Everyone knows array index arithmetics is difficult7.

I spent 2 days finishing the skeleton of the fcitx5 add-on - what am I trying to build again?

The ACTUAL problem

How input method engines work is that, you give in an input string and the engine transform it into a list of candidates in the output language. In this case, romanized Cantonese input is transformed into Chinese. The same pronunciation can correspond to many different Chinese characters, so we would like to input more words to form vocabulary that limit the number of candidates.

More specifically, the engine does the following:

  1. Split the input string into a list of possibly incomplete pronunciations (people are lazy)
  2. Get a list of possible Chinese characters that corresponds to the first pronunciation
  3. For all the Chinese characters in the list, get all possible vocabularies
  4. Limit the candidate list by doing set intersection against possible Chinese characters of remaining words in the input string
  5. Order the candidates according to some criteria.

For example:

input: d
output: 2437 candidates

input: diu
output: 87 candidates

input: diun
words: diu n
output 4 candidates

input: dnl
words: d n l
output: 5 candidates

input: dnlmo
words: d n l mo
output: 1 candidate

So step 1 is...

I forgot to mention step 0.

It's not a Directed Acyclic Graph. It's just a tree

Step 0 is reading the dictionaries. The original website fetches all codes and vocabularies on page load, so I kindly grabbed all data I needed, did some processing, turned them into usable CSV.

There are 743 pronunciations and 6186 Chinese words in the dictionary I am using. As described above that I may want to lookup words with incomplete pronunciations, I immediately ruled out plain arrays and associative arrays to store the dictionary. Imagine having to list all possible candidates with

dict |
    filter((key, words) => key.starts_with(splitted_input[0])) |
    map((_, words) => words) |
    flatten()
// You get the idea here

We will be doing a lot of lookups. Having O(n) time complexity may not be a good idea.

So instead, I am using a tree (obviously), where the key is the pronunciations and elements are attached to nodes (not only leaves). The time complexity is probably still O(n), but we are only traversing a significantly smaller subtree, which is way better than going through all 743 pronunciations8.

Elements attached to nodes and keys are not shown in graph

During the time when I was hand crafting the data structure, I thought it was a DAG (which sounds cool), so I named it DAGDict. But when I think about it at the time of writing this article, it is just a tree. Trees are DAG, but it's just a tree.

For vocabularies, I just used a hash table with key being the first word of the vocabulary.

Back to step 1

Step 1 is just parsing. I have done quite a bit of parsing before so it isn't really that difficult. I have a set of possible initials that a word can start with. They act like keywords. The later part of the word (finals) are like identifiers.

The engine tries to find an initial. When it finds it, it greedily consumes the next characters by checking if they constitute a valid incomplete pronunciation. Sounds easy. Indeed, it is easy compared to parsing C headers, except I spent 2 hours getting it right realizing I did not get it right at all until actually using it. Tests don't exist anyways. There is only production.

Rust style programming in C++

I have been working on another project written in Rust for the past few months. Some people praise Rust for its safety, some people hate it for its complexity. I feel deeply about the complexity issue, but one must admit Rust iterators are damn good.

The whole engine is essentially breaking input into array of strings and transform that into another array of strings. Iterators, and transformation! The best use case of Rust! The Rustaceans me kept whispering to my ears, telling me to use Rust.

I managed to stick to my guns because it will definitely be a huge pain in the ass to embed Rust in a CMake project that links to a system I don't understand. But I guess I could not resist my desire to write Rust iterator pipelines.

std::vector<std::string> IME::candidates(const std::string& input) {
    // TODO: replace wrong input... HOW?
    auto words = split_words(input);
    if (words.empty()) {
        return {};
    }

    auto c1 = codes.get_all(words[0]) | rv::transform([&](const auto& head) {
                  return vocabs.contains(head)
                             ? vocabs[head]
                             : std::vector<std::pair<std::string, size_t>>();
              }) |
              flatten<std::pair<std::string, size_t>>();
    for (int i = 1; i < words.size(); i++) {
        auto sub_candidate = codes.get_all(words[i]);
        c1 = c1 | rv::filter([&](const auto& p) {
                 return utf8::distance(p.first.cbegin(), p.first.cend()) > i;
             }) |
             rv::filter([&](const auto& p) {
                 auto s = p.first.cbegin();
                 utf8::advance(s, i, p.first.cend());
                 auto e = s;
                 utf8::next(e, p.first.cend());
                 return std::find(
                            sub_candidate.cbegin(),
                            sub_candidate.cend(),
                            std::string_view(s, e)) != sub_candidate.cend();
             }) |
             ranges::to<std::vector>();
    }

    auto c2 = c1 | rv::transform([&](auto&& p) {
                  std::string last_char;
                  auto it = p.first.cend();
                  utf8::append(
                      utf8::prior(it, p.first.cbegin()),
                      std::back_inserter(last_char));
                  auto spell = dict[last_char];
                  auto dist = spell.starts_with(words.back())
                                  ? spell.length() - words.back().length()
                                  : ULONG_MAX;
                  return std::make_tuple(
                      std::move(p.first), std::move(p.second), dist);
              }) |
              ranges::to<std::vector>();
    c1.clear();

    std::sort(c2.begin(), c2.end(), [](const auto& l, const auto& r) {
        return std::get<1>(l) > std::get<1>(r);
    });
    std::stable_sort(c2.begin(), c2.end(), [](const auto& l, const auto& r) {
        return utf8::distance(std::get<0>(l).cbegin(), std::get<0>(l).cend()) <
               utf8::distance(std::get<0>(r).cbegin(), std::get<0>(r).cend());
    });
    std::stable_sort(c2.begin(), c2.end(), [](const auto& l, const auto& r) {
        return std::get<2>(l) < std::get<2>(r);
    });

    return c2 | rv::transform([](const auto& t) { return std::get<0>(t); }) |
           ranges::to<std::vector>();
}

Here, is the absolute mess of C++23 combined with UTF8 string processing. Compiler errors sucks as usual when I didn't get the pipeline right. Average C++ template instantiation error. I absolutely hate the formatting either. I can't find a good set of configurations for clang-format.

C++ STL, despite having so many containers and utility functions, cannot process UTF8 strings. Maybe <codecvt> is the way, but it was deprecated in C++17, so I chose utfcpp. To be more precise, a folder named utf8 found in one of my old C++ projects buried deep in the graveyard that is probably utfcpp.

CMake Hell

The integration step went pretty well. I copied the source files of the engine into the add-on project, included them in the CMakeList.txt - code compiled!

Until I decided to make the engine a standalone CMake project.

The amount of pain and rant to CMake probably deserves its own blog post. The TL;DR is, I spent 2 hours searching deep into Stack Overflow and CMake forums to find the one flag I need to enable in order to make CMake compile the correct thing for ld to produce the final shared object. Average day of programmer banging his head into the wall, trying to make something he don't understand work properly :)

Improving the candidate list

The candidate list sucks. I needed to input so many characters to limit the candidate list to less than one page just for the one Chinese character I want.

The problem is the ordering. There are 3 criteria that I want the candidates to be ordered in.

  1. The candidate list should give me exactly the number of Chinese characters when I input that number worth of pronunciations. When I input d, I want the candidates on the first page to only be one Chinese characters
  2. I want the more completed words to be at the top. When I input 'd', I want words pronouncing da placed before words pronouncing dong, because I completed 50% of da and only 25% of dong
  3. I want the more frequently used words to be placed in front. When I input dai, "大" meaning big should be placed in front of "娣" which I don't even know this word

The first 2 criteria are pretty simple. Simply attach more information to the candidate list and sort them accordingly before returning the results. But for the third criteria, I simply don't have the data.

Luckily there is public domain vocabulary frequency data available online. Integrating it to existing engine only took me less than an hour, at the expense of the rich vocabulary set. I am the only user so no one is noticing the reduced vocabulary set anyways.

There is probably some black magic in the original website to use data not prefetched to improve the candidate list. But I am fairly satisfied with the candidate list now. There is no reason for me to spend extra effort reverse engineer the JavaScript I guess.

Conclusion

I spent a whole week, about 2 to 3 hours per day working on this side project. I opened sourced the code. However, this is a project written for myself only at the end of the day. I'm not sure if I will accept contributions at all. But one thing for sure, the GitHub repo will have 0 stars till the fall of humanity.

Rust is great, procedural combined with functional programming is the king, C++23 sucks, fuck CMake.

Footnotes

  1. I just wrote a one line script to convert this markdown into html preview because markdown-preview.nvim refuses to work on my computer

  2. Can somebody tell me how to pronounce fcitx? Wiki says it’s [ˈfaɪtɪks] but I can’t read this

  3. Macro magics are fun

  4. C++ is just C with some useful convenient functions and constructs

  5. a monad is a monoid in the category of endofunctors, what's the problem?

  6. The skill issue language

  7. There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

  8. No benchmarks performed

lokxii.bsky.social
ろくしぃ

@lokxii.bsky.social

快楽主義
ガンランスでMH3G四天王制覇した人
メゼポルタに所属するガンサー
星の翼でチンパンやってる

個人サイト: lokxii.github.io

アイコン→ @yutan-po.bsky.social

子供たち
ちゃあはんくん → @marchov.bsky.social
しゅうけいくん → @shuukei.bsky.social
よみあげくん→配信用、非公開

Post reaction in Bluesky

*To be shown as a reaction, include article link in the post or add link card

Reactions from everyone (0)