Skip to content

Commit

Permalink
[libmlocale] Handle bucketing for locales with conflicting data. Cont…
Browse files Browse the repository at this point in the history
…ributes to JB#46546

Some locales like Russian and Hungarian have kind of conflicting data:
exemplar character index list including both unaccented and accented
vowel, while collation rules not including primary strenght difference.

Earlier MLocale::indexBucket() returned the latter bucket for all
strings starting with either unaccented or accented character, which
was commonly wrong.

Added a special case for checking if adjacent buckets are considered
equal by the collator, and on such case mapping to latter one only
if bucketed string starts with that. Meaning that other accents go to
the earlier character, assumed to be the base bucket.

Other option would have been doing some sanity check for exemplar
character list and filtering out characters that don't have primary
level difference to the earlier one.
  • Loading branch information
pvuorela committed Oct 2, 2019
1 parent be31dba commit 83f306f
Showing 1 changed file with 17 additions and 9 deletions.
26 changes: 17 additions & 9 deletions src/mlocale.cpp
Expand Up @@ -3971,21 +3971,29 @@ QString MLocale::indexBucket(const QString &str, const QStringList &buckets, con
if (coll(strUpperCase, buckets[i])) {
if (i == 0) {
return firstCharacter;
} else if (buckets.first() == QString::fromUtf8("")) { // stroke count sorting
return QString::number(i) + QString::fromUtf8("");
} else if (i > 1 && !coll(buckets[i-2], buckets[i-1])
&& !str.startsWith(buckets[i-1], Qt::CaseInsensitive)) {
// some locales have conflicting data as in exemplar characters containing accented variants
// of some letters while collation doesn't have primary level difference between them,
// for example hungarian short and long vowels, and russian Е/Ё.
// in such case return the earlier bucket for all strings that don't start with the latter
// To consider: do we need to handle even longer runs of primary level equal buckets?
return buckets[i-2];
}
else {
if(buckets.first() == QString::fromUtf8("")) // stroke count sorting
return QString::number(i)+QString::fromUtf8("");
else
return buckets[i-1];
}

return buckets[i-1];
}
}
// return the last bucket if any substring starting from the beginning compares
// primary equal to the last bucket label:
for (int i = 0; i < strUpperCase.size(); ++i)
if(!coll(buckets.last(),strUpperCase.left(i+1))
&& !coll(strUpperCase.left(i+1),buckets.last()))
for (int i = 0; i < strUpperCase.size(); ++i) {
if (!coll(buckets.last(),strUpperCase.left(i+1))
&& !coll(strUpperCase.left(i+1), buckets.last())) {
return buckets.last();
}
}
// last resort, no appropriate bucket found:
return firstCharacter;
}
Expand Down

0 comments on commit 83f306f

Please sign in to comment.