HashMap does not behave as expected for Chinese characters

827 views Asked by At
China-中国,CN
Angola-安哥拉,AO
Afghanistan-阿富汗,AF
Albania-阿尔巴尼亚,AL
Algeria-阿尔及利亚,DZ
Andorra-安道尔共和国,AD
Anguilla-安圭拉岛,AI

In Java, I'm reading the above text from a file and creating a map where the keys will be the part before the comma and the values will be the region code after the comma.

Here is the code:

public static void main(String[] args) {

    BufferedReader br;
    Map<String,String>  mymap = new HashMap<String,String>();
    try {
        br = new BufferedReader(new InputStreamReader(new FileInputStream("C:/Users/IBM_ADMIN/Desktop/region_code_abbreviations_Chinese.csv"), "UTF-8"));
        String line;
        while ((line = br.readLine()) != null) {
           //System.out.println(line);
           String[] arr= line.split(",");
           mymap.put(arr[0], arr[1]);
        }

        br.close();
    } catch (IOException e) {
        System.out.println("Failed to read users file.");
    } finally {}

    for(String s: mymap.keySet()){
        System.out.println(s);
        if(s.equals("China-中国")){
            System.out.println("Got it");
            break;
        }
    }

    System.out.println("----------------");
    System.out.println("Returned from map  "+ mymap.get("China-中国"));

    mymap = new HashMap<String,String>();
    mymap.put("China-中国","Explicitly Put");
    System.out.println(mymap.get("China-中国"));
    System.out.println("done");
}

The output:

:
:
Egypt-埃及
Guyana-圭亚那
New Zealand-新西兰
China-中国
Indonesia-印度尼西亚
Laos-老挝
Chad-乍得
Korea-韩国
:
:
Returned from map  null
Explicitly Put
done

Map is loaded correctly but when I search the map for "China-中国" - I do not get the value.

If I explicitly put "China-中国" in map, then it returns a value. Why is this happening?

3

There are 3 answers

4
wumpz On BEST ANSWER

Check if your resource file is not UTF-8, e.g. UTF-8Y, with BOM Bytes at the start. But this would only infere with the first value. If you change the test to a value from the middle, do you have a value or not? If not then this is not the problem.

Second possibility is your source code file is not UTF-8. Therefore the byte sequence of "China-中国" of your resource file and your sourcecode file is not equal and you will not get a match. But you include the value with the sourcecodes byte sequence explicitly and it will be found.

In fact this is not a problem with HashMap but with character or file encoding.

1
Patrick Parker On

Since you are having a problem with the first value, I would check to see if the file starts with a BOM (Byte Order Mark).

If so, try stripping the BOM before processing.

See: Byte order mark screws up file reading in Java

1
Tom Grylls On

You can use org.apache.commons.io.input.BOMInputStream.

BufferedReader br= new BufferedReader(new InputStreamReader(new BOMInputStream(new FileInputStream("filepath")),"UTF-8"))