I performed some tests comparing the results from different translation models translating between English and Chinese. All of these models are available for local offline use.

The goal is to find an ideal model to use on my own devices for translating this language pair.

Overview
[permalink]

I ran this in batches, with different sets of models each. I supplied all models in each batch a set of texts. A native Chinese speaker proficient in English evaluated the translations and ranked them from best to worst.

First batch of tests
[permalink]

My first set of subjects were LibreTranslate (with its current default model), Opus-MT and NLLB-600M. All of these run perfectly fine on consumer hardware.

Below is the verdict of the first batch of tests (I didn’t keep more granular results for the first batch).

English to Chinese:
[permalink]

LibreTranslate: good quality, unnaturally spoken, but clear.
Opus-MT: more naturally spoken, but results not necessarily as accurate.
NLLB: worst of the three.

Based on these results, the recommendations are:

For communicating with a Chinese speaker, use Opus-MT.
For language learning, use LibreTranslate (more literal).

Chinese to English
[permalink]

LibreTranslate: not as good as the others, tense also wrong, misses a word.
Opus-MT: mixes up tense, but otherwise good.
NLLB: is very literal (not in a bad way).

Based on these results, the recommendations are:

For interpreting text in Chinese, use Opus-MT.
For communicating with an English speaker, use NLLB.

Second batch of tests
[permalink]

For my second batch of comparison, I used some newer (and also heavier) models:

Opus-MT: same as before.
NLLB-600M: same as before.
NLLB-3.3B: heavier (and slower) than NLLB-600M.
MADLAD-400: substantially heavier, and trickier to run on consumer hardware.
Tower: Technically an LLM trained specially for translations, much heavier.

For this second batch of tests, I translated from Chinese to English, and kept more detailed notes of the results.

Test 2.1
[permalink]

Input: 但说真的，一想到总有一天我也会把这个故事忘诸脑后，也是有点感伤呢。
Opus-MT (6.70s): But really, it’s kind of sad to think that someday I’m going to forget about it.
NLLB (7.93s): But honestly, it’s a bit painful to think that one day I’ll forget this story too.
NLLB-3.3B (16.82s): But to be honest, it’s a little sad to think that one day I’ll forget this story too.
MADLAD-400 (30.37s): But, honestly, it’s a little sad to think that one day I’ll forget this story.
Tower (331.99s): But to be honest, when I think about the fact that someday I will forget this story, I feel a little sad.

Opus-MT omits the subject in the second part, which really washes out the meaning of the sentence. MADLAD omits “too”, which was considered as changing the original meaning too much. Tower was considered the best since it retained the original order of the statements, which in turn maintains its melancholic tone.

The ranking for these is (from best to worst):

Good results: Tower, NLLB-3.3B, NLLB.
Failed results: MADLAD-400, Opus-MT.

Test 2.2
[permalink]

The translation by deekseek.com is included here for reference.

Input: 也就是一刹那的事，一种说不清道不明的疲倦的伤怀爬上他的心头。天边的乌云正在散去，几缕阳光挟着风的轻巧和雨的温柔钻进窗棂，吻了吻桌前那盆已开始枯萎的雏菊，春末的气息就这样被若无其事地搅乱了。
Opus-MT (8.06s): That is, at a moment, an inexplicable tired wound that climbs over his heart. The clouds of the sky are dissipating, with the light of the wind and the tenderness of the rain rushing into the window, kissing the moist daisies that have begun to fade in front of the table, and the breath of the end of the spring is disturbed by nothing.
NLLB (10.22s): And it came to pass that a sore wound of fatigue, unknown to the way, came upon his head. And the clouds of the sky were scattered, and the light wind and the gentle rain of the sun drove into the curtain, and kissed the basin before the table, which was beginning to wither, and the breath of the end of spring was disturbed.
NLLB-3.3B (28.12s): It was a moment, a weary sorrow, which he could not explain, that climbed up his heart. The clouds of the sky were dissipating, and the light of the sun, with the gentle wind and the gentle rain, penetrated into the curtains, kissing the dew that had begun to wither in the pot before the table, and the breath of late spring was thus disturbed without cause.
MADLAD-400 (58.45s): At that moment, a sort of unspeakable fatigue and sadness climbed up to his heart. The dark clouds were dissipating, and a few rays of sunlight, caught between the lightness of the wind and the gentleness of the rain, penetrated the window, kissed the daisies in the pot before the kiss table, which had begun to wither, and the air of the end of spring was thus disturbed as if nothing had happened.
Tower (621.65s): It was over in a flash, a kind of fatigue that was hard to describe crept into his heart. The clouds on the horizon were dispersing, and a few rays of sunlight, carried by the wind and the rain, crept through the window and kissed the dying chrysanthemum
Deepseek.com: In the blink of an eye, an indescribable, weary melancholy crept into his heart. The dark clouds on the horizon were dispersing, and a few strands of sunlight, carrying the lightness of the wind and the tenderness of the rain, slipped through the window lattice. They kissed the pot of withering daisies on the desk, quietly unsettling the late spring atmosphere.

Tower’s results here are truncated, and I later discovered that this is due to how I configured my test service. This model is much less straightforward to run than the others, and requires a lot of tuning and understanding of the domain space which I lack.

Ranking for passing results: MADLAD-400, NLLB-3.3B.
Ranking for failing results: Opus-MT, NLLB.

Test 2.3
[permalink]

Input: 澄澈而晴朗的苍穹万里无云，暖得有些过分的阳光仿佛给远方的风景草木都蒙上一层微微颤动的空气薄纱。这是故乡难得的好天气。周围很是安静，除了夏风轻轻掠过树木的声音和虫鸟的鸣叫，什么也听不见。他的马低头靠了过来，在他脸上嗅了嗅，喷出一个响鼻。
Opus-MT (7.79s): The clear and clear sky is cloudless, and some of the excessive sun warms up as if it were covered with a small tremor of air in distant landscapes. This is a fine weather for home. The surroundings are very quiet, and nothing can be heard except the sound of the summer breezes over the trees and the sound of the bugbirds. His horse leans down, sniffs in his face and snorts a loud nose.
NLLB (11.98s): The clear and clear cloudless, warm and overly sunny grassland was covered with a thin layer of air that gave the distant landscape a slightly creeping tinge. It was a strange weather in his hometown. It was very quiet, and nothing could be heard except the sound of the summer breeze gently sweeping through the trees and the cries of insect birds. His horse’s lower head leaned over, sniffed in his face, and sprayed a whistling nose.
NLLB-3.3B (30.70s): The clear and sunny sky was cloudless, the sun was warm and a little too much, as if the air were thin and slightly shimmering over the distant grass. This was the rare good weather in my hometown. It was very quiet, nothing was heard around except the sound of the summer wind gently sweeping through the trees and the chirping of insect birds. His horse leaned over, sniffed on his face, and spat out a rattling nose.
MADLAD-400 (93.79s): The sky was clear and clear, and the sun was warm enough to cover the distant landscape with a slightly fluttering veil of air. It was a rarely good weather in his homeland. It was quiet, and nothing was heard except the sound of the summer wind blowing through the trees and the chirping of insects and birds. His horse leaned over, sniffed his face, and gave a sniff.
Tower (737.99s): The sky was clear and bright, with not a cloud in the vast expanse of blue. The warm sunshine seemed to cast a light veil over the distant scenery and vegetation. This was a rare good day in the homeland. The surroundings were quiet, except for the sound of the

Again, Tower’s result is truncated, due to the configuration I used for its output. Its results would have been considered the best of all five had it not been truncated.

Ranking for passing results: MADLAD-400, NLLB-3.3B. Failed: NLLB, Opus-MT.

Test 2.4
[permalink]

Input: 这大概是一条颠扑不破的真理，一段平淡的旧事总有戏剧化的开端。
Opus-MT (6.45s): This is probably an intransigent truth, and there is always a dramatic beginning to a flat old story.
NLLB (7.93s): This is probably an unflinching truth, a plain old thing that always begins dramatically.
NLLB-3.3B (17.41s): It’s probably an unbreakable truth that a plain old thing always has a dramatic beginning.
MADLAD-400 (26.26s): It is probably an incontrovertible truth that a dull old story always has a dramatic beginning.
Deepseek.com: This is likely an unshakable truth: a mundane old tale always possesses a dramatic beginning.

At this point I removed Tower from the list. It’s clear that it consistently provides the best results, but requires an excessive amount of RAM (~25GB), and over 10 minutes to process these samples. Ideally, Tower needs to be run on GPU rather than CPU, and much more fine tuning than I was interested in doing for this experiment. It can technically run on consumer hardware, but not on most of my devices.

Ranking:

Ranking for passing results: MADLAD-400
Ranking for failing results: NLLB-3.3B, Opus-MT, NLLB.

Test 2.5
[permalink]

Input: 旅游签证的发放对象是以观光旅游为目的的申请人。因此，除观光旅游外的短期访问目的都不属于此范围（除了部分签证）。
Opus-MT (1.23s): Tourism visas are granted to applicants for tourist purposes.
Opus-MT (6.92s): Tourism visas are granted to applicants for tourist purposes. Thus, short-term visits, with the exception of tourist tours, do not fall within this scope (with the exception of partial visas).
NLLB (8.59s): The object of the tourist visa is the applicant for tourist purposes. Therefore, no short-term visit purposes other than tourist tourism fall within this scope (except for some visas).
NLLB-3.3B (33.25s): Tourist visas are issued to applicants for the purpose of sightseeing tourism. Therefore, short-term visits for purposes other than sightseeing tourism are not covered (except for some visas).
MADLAD-400 (32.56s): Tourist visas are issued to applicants for the purpose of tourism. Therefore, short-term visits for purposes other than tourism are not covered (except for some visas).

During the initial run Opus-MT only translated the first sentence. It turns out that this model only supports translating a single sentence at a time. I worked around this by updating my test script to split sentences, translate them individually, and then join them again.

All models passed this test, from best to worst: NLLB-3.3B, MADLAD-400, NLLB, Opus-MT.

Test 2.6
[permalink]

Input: 对于已注册用户，可以点击页面右上角的“登录”按钮，并输入自己的账号和密码进行登录。登录成功后，用户即可享受网页版提供的各项功能，如浏览商品、下单购买、查看订单等。
Opus-MT (7.08s): For registered users, you can click on the " Login " button at the top right corner of the page and enter your own account number and password to log in. After a successful login, users will be able to enjoy the various functions offered on the web page, such as browsing commodities, purchase orders, viewing orders, etc.
NLLB (8.93s): For registered users, you can click the login button in the top right corner of the page and enter your account and password to log in. After successful log-in, you can enjoy the features of the web version, such as browsing goods, ordering, ordering, etc.
NLLB-3.3B (22.84s): For registered users, you can click on the login button in the upper right corner of the page and enter your account and password to log in. After successful login, you can enjoy the features offered by the web page, such as browsing products, ordering purchases, viewing orders, etc.
MADLAD-400 (47.52s): For registered users, you can click the “Login” button in the upper right corner of the page and enter your account number and password to log in. After successful login, users can enjoy the various functions offered by the web version, such as browsing products, placing purchase orders, viewing orders, etc.

All models passed this test, from best to worst: MADLAD-400, Opus-MT/NLLB-3.3B (draw), NLLB.

Test 2.7
[permalink]

Input: 身后有余忘缩手，眼前无路想回头。
Opus-MT (6.84s): Behind them, there is no way to look back.
NLLB (7.65s): I have a shrinking hand behind me, and I can’t look back.
NLLB-3.3B (16.79s): The people who are left behind are those who have forgotten and have no way of looking back.
MADLAD-400 (35.74s): There is no way to turn back.
Human translation: Forget to withdraw while there is still space behind, and only think of turning back when no path is seen in front.

All models failed this test.

Third batch of tests
[permalink]

The third and final batch of tests focuses on the same models as the second batch, but translating English to Chinese.

Test 3.1
[permalink]

Input: The database was built from scratch. It did not have this many functions. And guess what? People are now asking for more! There is no end to their greed.
Opus-MT (1.67s): 数据库是从头建立起来的。它没有这么多职能。你猜怎么着? 现在人们在要求更多! 他们的贪婪是没有尽头的。
NLLB (8.32s): 据说数据库从头开始,它没有这么多功能. 猜猜什么?人们现在要求更多!他们的贪无止境.
NLLB-3.3B (19.62s): 数据库是从头开始建立的. 它没有这么多功能. 猜猜吧? 人们现在要求更多! 他们的贪没有尽头.
MADLAD-400 (35.68s): 数据库是从零开始构建的,它没有这么多的功能。猜猜看? 人们现在要求更多! 他们的贪婪没有止境。

Ranking (all pass): Opus-MT, MADLAD, 3.3B, NLLB

Test 3.2
[permalink]

Input: This man is completely out of his sense. He speaks madness. Nonsense!

Opus-MT (7.30s): 这个男人完全疯了。他讲疯话。胡说八道!。
NLLB (8.00s): 这个人完全失去了头脑,他说话是疯狂的.
NLLB-3.3B (16.22s): 这个人完全失去了理智.他说疯狂.胡说八道!
MADLAD-400 (24.70s): 这个人完全失去理智了他说的疯了胡说八道

Ranking:

Passing: Opus-MT
Failed: MADLAD / NLLB-3.3B (draw), NLLB

Test 3.3
[permalink]

Input: It is hard to tell which one he is quietly and deeply gazing at, the bloody hue of the sinking sun at the horizon, or that pale azure moon hanging in the early summer sky. Translation: en → zh

Opus-MT (6.72s): 很难分辨他静静地、深深地注视着谁, 地平线下沉的太阳的血腥光芒, 或夏日初的苍蓝的月亮。
NLLB (8.61s): 很难说他静静地深地看着哪个,即向平线下沉的阳光的血色,还是在夏天早些时候的蓝色月亮.
NLLB-3.3B (22.13s): 很难说他静静地深深地凝视着哪一个, 血的阴影的日落在地平线上, 或那个白的蓝色的月亮悬挂在夏天的天空.
MADLAD-400 (35.45s): 很难分辨他是静静地、深深地凝视着哪一个,是地平线上落下的血色太阳,还是初夏天空中那颗苍白的蓝月亮。

Ranking:

Passing: MADLAD, Opus-MT.
Failing: NLLB-3.3B, NLLB.

Conclusions
[permalink]

Tower consistently seems to provide the best translations, but is also in another category of resource requirements and speed. I find it unsuitable for modern consumer hardware.

Of the remaining models, all failed at least two tests. There is also a clear asymmetry depending on the direction of the translation. When translating simple or technical text (e.g.: not metaphors or poetry), all the models provide useful results.

Opus-MT stands out positively when translating English to Chinese, and does a generally okay job translating Chinese to English. It is by far the lightest and fastest of the models, and quite suitable for average consumer devices.

MADLAD-400 stands out in the opposite direction, when translating Chinese to English, but can also yield imperfect results. This model is noticeably slower than Opus-MT, and uses a huge amount of RAM (between 30 and 40GB).

NLLB-3.3B does a pretty good job translating Chinese to English, but did rather poorly in all tests in the opposite direction. NLLB-3.3B uses around 13–16GB, while NLLB-600M uses around 2.5–3GB. This makes the former somewhat unsuitable for most devices (especially phones), while the latter can run on much more diverse hardware.

Overview[permalink]

First batch of tests[permalink]

English to Chinese:[permalink]

Chinese to English[permalink]

Second batch of tests[permalink]

Test 2.1[permalink]

Test 2.2[permalink]

Test 2.3[permalink]

Test 2.4[permalink]

Test 2.5[permalink]

Test 2.6[permalink]

Test 2.7[permalink]

Third batch of tests[permalink]

Test 3.1[permalink]

Test 3.2[permalink]

Test 3.3[permalink]

Conclusions[permalink]

Overview
[permalink]

First batch of tests
[permalink]

English to Chinese:
[permalink]

Chinese to English
[permalink]

Second batch of tests
[permalink]

Test 2.1
[permalink]

Test 2.2
[permalink]

Test 2.3
[permalink]

Test 2.4
[permalink]

Test 2.5
[permalink]

Test 2.6
[permalink]

Test 2.7
[permalink]

Third batch of tests
[permalink]

Test 3.1
[permalink]

Test 3.2
[permalink]

Test 3.3
[permalink]

Conclusions
[permalink]