2 min readfrom Machine Learning

Follow-up on the TranslateGemma subtitle benchmark: human review of segments rated "clean" by MetricX-24 and COMETKiwi [D]

A few weeks ago I shared the results of a benchmark here comparing 6 LLMs on subtitle translation, scored with two reference-free QE metrics - MetricX-24 (~13B mT5-XXL) and COMETKiwi (~10.7B XLM-R-XXL) - combined into a TQI index. Posting a follow-up because we did human review afterwards, and the result is worth discussing.

The original benchmark put TranslateGemma-12b first in every language pair. The natural question: are those high scores accurate, or are the metrics insensitive in their high-confidence zone? These metrics correlate well with human judgment at the population level (that's what they're trained for), but population-level correlation doesn't tell you whether the segments they call "clean" are actually clean.

So we ran the check directly. 21 English subtitle segments from one tutorial video. TranslateGemma's translations into 4 languages (ES, JA, TH, ZH-CN - Korean and Traditional Chinese got dropped). All 84 translations chosen because they passed the dashboard clean-rule (MX < 5 AND CK ≥ 0.70) in all 4 languages simultaneously. Then full MQM annotation by professional linguists - Major/Minor severity, with categories covering accuracy (mistranslation, omission, addition, untranslated), fluency (grammar, punctuation, inconsistency), style, terminology.

Results under the dashboard threshold:

  • Auto-flagged: 1/84
  • Human-flagged: 60/84 any-error, 13/84 Major-only
  • Metric-blindness rate (auto-clean ∩ human-flagged / auto-clean): 59/83 = 71% any-error, 12/83 = 14.5% Major-only
  • All 25 human-found Accuracy-class errors fell in the metric-blind quadrant. Zero overlap with the auto-flagged region (which contained one Style-category Major error).
  • Japanese carries 10 of 15 total mistranslations across the dataset, all metric-blind, despite having the highest mean COMETKiwi (0.863) of the four languages.

Caveat: small n, one model, one content set, so the numbers are directional rather than definitive.

Original thread: [link]
Full benchmark report: in comments.

submitted by /u/ritis88
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#financial modeling with spreadsheets
#natural language processing for spreadsheets
#enterprise-level spreadsheet solutions
#rows.com
#natural language processing
#generative AI for data analysis
#large dataset processing
#row zero
#Excel alternatives for data analysis
#no-code spreadsheet solutions
#TranslateGemma
#subtitle translation
#benchmark
#MetricX-24
#COMETKiwi
#human review
#quality estimation metrics
#TQI index
#clean-rule
#MQM annotation