Skip to content

Commit

Permalink
Fix space omission when translating into CJK languages (#956)
Browse files Browse the repository at this point in the history
This PR fixes an issue where we are checking for equality on the
language tag for CJK languages, when we really need to be checking
if the tag starts with the language tag.

Before this PR `zh-Hans` would not match. After this PR it will match.
  • Loading branch information
nordzilla authored Dec 11, 2024
1 parent 0f2268f commit b2c788c
Showing 1 changed file with 10 additions and 3 deletions.
13 changes: 10 additions & 3 deletions inference/src/translator/annotation.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,13 @@ void AnnotatedText::appendSentence(string_view prefix, std::vector<string_view>:
annotation.token_begin_.push_back(offset);
}

/// A simple helper function to check if a string starts with a prefix.
/// The std::string object only has a starts_with() method in C++20, which
/// is not what we are currently compiling with.
bool startsWith(string_view prefix, string_view str) {
return str.size() >= prefix.size() && prefix == str.substr(0, prefix.size());
}

bool AnnotatedText::shouldOmitSpaceBetweenSentences() const {
if (targetLanguage_.empty()) {
// The target language is not specified, so we should not make assumptions about
Expand All @@ -45,11 +52,11 @@ bool AnnotatedText::shouldOmitSpaceBetweenSentences() const {
// More robustly handle which language tags should omit whitespace between sentences.
return (
// Japanese does not use space between sentences.
targetLanguage_ == "ja" ||
startsWith("ja", targetLanguage_) ||
// Korean does not use space between sentences.
targetLanguage_ == "ko" ||
startsWith("ko", targetLanguage_) ||
// Chinese does not use space between sentences.
targetLanguage_ == "zh"
startsWith("zh", targetLanguage_)
);
}

Expand Down

0 comments on commit b2c788c

Please sign in to comment.