We are using the Vision API's OCR service (DOCUMENT_TEXT_DETECTION). However, since around 9:00 AM (JST) on March 8, 2024, we have noticed that some Japanese (JA) text is being recognized as old Japanese characters.
For example, the character "内" (nai) is being recognized as the old form "內" (nai). This is not happening for all old characters, and sometimes the standard Japanese character is returned.
This issue has not occurred in the past. Additionally, for documents that were recognized with old characters after March 8, 2024, subsequent recognitions will also return results with mixed old and new characters.
We have checked the response locale. Initially, we thought that this issue only affected the "und" locale, but we have confirmed that it also occurs with the "ja" locale.
Has there been a recent change to the internal algorithm?
If there is any solution to this problem, please let us know.
Thank you in advance for your help.
Additional Information:
- Language: Japanese (JA)
- OS: Windows
- ENDPOINT: https://vision.googleapis.com/v1/images:annotate
- SDK: REST
Reproducible Body:
{
"requests": [
{
"image": {
"source": {
"imageUri": "CLOUD_STORAGE_IMAGE_URI"
}
},
"features": [
{
"type": "DOCUMENT_TEXT_DETECTION"
}
]
}
]
}
Expected Output:
内閣府
Actual Output:
內閣府
Please refer to this page.
https://cloud.google.com/vision/docs/release-notes?hl=ja
The December 5 update was reflected in stable on or about March 8.
In our project, Japanese is also misidentified as Chinese.