‹ back home

Translation models between English and Chinese

2025-11-02 #chinese #translations

I performed some tests comparing the results from different translation models translating between English and Chinese. All of these models are available for local offline use.

The goal is to find an ideal model to use on my own devices for translating this language pair.

Overview

I ran this in batches, with different sets of models each. I supplied all models in each batch a set of texts. A native Chinese speaker proficient in English evaluated the translations and ranked them from best to worst.

First batch of tests

My first set of subjects were LibreTranslate (with its current default model), Opus-MT and NLLB-600M. All of these run perfectly fine on consumer hardware.

Below is the verdict of the first batch of tests (I didn’t keep more granular results for the first batch).

English to Chinese:

Based on these results, the recommendations are:

Chinese to English

Based on these results, the recommendations are:

Second batch of tests

For my second batch of comparison, I used some newer (and also heavier) models:

For this second batch of tests, I translated from Chinese to English, and kept more detailed notes of the results.

Test 2.1

Opus-MT omits the subject in the second part, which really washes out the meaning of the sentence. MADLAD omits “too”, which was considered as changing the original meaning too much. Tower was considered the best since it retained the original order of the statements, which in turn maintains its melancholic tone.

The ranking for these is (from best to worst):

Test 2.2

The translation by deekseek.com is included here for reference.

Tower’s results here are truncated, and I later discovered that this is due to how I configured my test service. This model is much less straightforward to run than the others, and requires a lot of tuning and understanding of the domain space which I lack.

Test 2.3

Again, Tower’s result is truncated, due to the configuration I used for its output. Its results would have been considered the best of all five had it not been truncated.

Ranking for passing results: MADLAD-400, NLLB-3.3B. Failed: NLLB, Opus-MT.

Test 2.4

At this point I removed Tower from the list. It’s clear that it consistently provides the best results, but requires an excessive amount of RAM (~25GB), and over 10 minutes to process these samples. Ideally, Tower needs to be run on GPU rather than CPU, and much more fine tuning than I was interested in doing for this experiment. It can technically run on consumer hardware, but not on most of my devices.

Ranking:

Test 2.5

During the initial run Opus-MT only translated the first sentence. It turns out that this model only supports translating a single sentence at a time. I worked around this by updating my test script to split sentences, translate them individually, and then join them again.

All models passed this test, from best to worst: NLLB-3.3B, MADLAD-400, NLLB, Opus-MT.

Test 2.6

All models passed this test, from best to worst: MADLAD-400, Opus-MT/NLLB-3.3B (draw), NLLB.

Test 2.7

All models failed this test.

Third batch of tests

The third and final batch of tests focuses on the same models as the second batch, but translating English to Chinese.

Test 3.1

Ranking (all pass): Opus-MT, MADLAD, 3.3B, NLLB

Test 3.2

Input: This man is completely out of his sense. He speaks madness. Nonsense!

Ranking:

Test 3.3

Input: It is hard to tell which one he is quietly and deeply gazing at, the bloody hue of the sinking sun at the horizon, or that pale azure moon hanging in the early summer sky. Translation: en → zh

Ranking:

Conclusions

Tower consistently seems to provide the best translations, but is also in another category of resource requirements and speed. I find it unsuitable for modern consumer hardware.

Of the remaining models, all failed at least two tests. There is also a clear asymmetry depending on the direction of the translation. When translating simple or technical text (e.g.: not metaphors or poetry), all the models provide useful results.

Opus-MT stands out positively when translating English to Chinese, and does a generally okay job translating Chinese to English. It is by far the lightest and fastest of the models, and quite suitable for average consumer devices.

MADLAD-400 stands out in the opposite direction, when translating Chinese to English, but can also yield imperfect results. This model is noticeably slower than Opus-MT, and uses a huge amount of RAM (between 30 and 40GB).

NLLB-3.3B does a pretty good job translating Chinese to English, but did rather poorly in all tests in the opposite direction. NLLB-3.3B uses around 13–16GB, while NLLB-600M uses around 2.5–3GB. This makes the former somewhat unsuitable for most devices (especially phones), while the latter can run on much more diverse hardware.

Have comments or want to discuss this topic?
Send an email to my public inbox: ~whynothugo/public-inbox@lists.sr.ht.
Or feel free to reply privately by email: hugo@whynothugo.nl.

— § —