Java 8 - Locale lookup behavior

5.4k views Asked by At

Introduced in Java 8, Locale.lookup(), based on RFC 4647, allows the user to find the best match for a list of Locale according a priority list of LocaleRange. Now I don't understand every corner case for this method. The following exposes one particular case I would like to have an explanation for:

// Create a collection of Locale objects to search
Collection<Locale> locales = new ArrayList<>();
locales.add(Locale.forLanguageTag("en-GB"));
locales.add(Locale.forLanguageTag("en"));

// Express the user's preferences with a Language Priority List
String ranges = "en-US;q=1.0,en-GB;q=1.0";
List<Locale.LanguageRange> languageRanges = Locale.LanguageRange.parse(ranges);

// Find the BEST match, and return just one result
Locale result = Locale.lookup(languageRanges,locales);
System.out.println(result.toString());

This prints en, where I would have intuitively expected en-GB.

Note that:

  • if you have a range of "en-GB;q=1.0,en-US;q=1.0" (GB and US reversed), this will print en-GB,
  • and if you have a range of "en-US;q=0.9,en-GB;q=1.0" (GB has a higher priority than US), this will print en-GB.

Could someone explain the rationale behind this behavior?

3

There are 3 answers

1
Holger On BEST ANSWER

If you provide language alternatives with the same priority, the list order becomes significant. This becomes apparent when you inspect the parsed list of "en-US;q=1.0,en-GB;q=1.0". It contains two entries, representing "en-US;q=1.0", followed by "en-GB;q=1.0"

See https://www.ietf.org/rfc/rfc4647.txt

3.4. Lookup

Lookup is used to select the single language tag that best matches the language priority list for a given request. When performing lookup, each language range in the language priority list is considered in turn, according to priority. … The first matching tag found, according to the user's priority, is considered the closest match and is the item returned. For example, if the language range is "de-ch", a lookup operation can produce content with the tags "de" or "de-CH" but never content with the tag "de-CH-1996". If no language tag matches the request, the "default" value is returned.

In the lookup scheme, the language range is progressively truncated from the end until a matching language tag is located. …

The last sentence describes what has already said by example in the first paragraph, i.e. a language range of de-CH might match either de-CH or de. This lookup with fallback is performed for each item of the list, stopping at the first one for which a match is found.

In other words, specifying "en-US;q=1.0,en-GB;q=1.0" is like specifying "en-US,en,en-GB,en".


Maybe what you want is filtering, see

3.3. Filtering

Filtering is used to select the set of language tags that matches a given language priority list. …

In filtering, each language range represents the least specific language tag (that is, the language tag with fewest number of subtags) that is an acceptable match.

Thus, given your original list of selectable locales

List<Locale> filtered = Locale.filter(
    Locale.LanguageRange.parse("en-US;q=1.0,en-GB;q=1.0"), locales);
System.out.println("filtered: "+filtered);

produces [en_GB].

whereas

Collection<Locale> locales = Arrays.asList(Locale.forLanguageTag("en"),
    Locale.forLanguageTag("en-GB"), Locale.forLanguageTag("en-US"));
List<Locale> filtered = Locale.filter(
    Locale.LanguageRange.parse("en-US;q=1.0,en-GB;q=1.0"), locales);
System.out.println("filtered: "+filtered);

produces [en_US, en_GB] (note the prioritized order and the absence of an en fallback). So depending on the context you may attempt to select from a filtered list first and only resort to lookup when the filtered list is empty.

At least, the behavior of Java’s implementation is in line with the specification. As you already noted, changing the priority or changing the order (when the priority is equal), changes the result according to the specification.

0
gontard On

The parse of the given range to generate a Language Priority List:

  • for "en-US;q=1.0,en-GB;q=1.0" the priority list is [en-us;=1.0,en-gb;=1.0]

  • for "en-GB;q=1.0,en-US;q=1.0" the priority list is [en-gb;=1.0,en-us;=1.0]

  • for "en-US;q=0.9,en-GB;q=1.0" the priority list is [en-gb;=1.0,en-us;=0.9]

Then the lookup method follow this priority list until it found a matching locale (according to the RFC 4647):

  • for en-us;=1.0,en-gb;=1.0, the algorithm take first en-us;=1.0 for which the best matching locale is en
  • for en-gb;=1.0,en-us;=1.0, the algorithm take first en-gb;=1.0 for which the best matching locale is en-GB
  • for en-gb;=1.0,en-us;=0.9, the algorithm take first en-gb;=1.0 for which the best matching locale is en-GB
0
Lukasz Wiktor On

The steps to get this result are as follows:

  1. does en-US match en-GB? → no
  2. does en-US match en? → no
  3. truncate en-US to en
  4. does en match en-GB? → no
  5. does en match en? → yes, matching tag found, return it

It works according to the RFC 4647:

3.4. Lookup

...

The first matching tag found, according to the user's priority, is considered the closest match and is the item returned.

...

In the lookup scheme, the language range is progressively truncated from the end until a matching language tag is located.

The core of the lookup algorithm is implemented in sun.util.locale.LocaleMatcher#lookupTag. You can check out the source code