Google Sites API full-text search does not work for non-Western languages

1k views Asked by At

In my JavaEE application, I'm using the Atom-based Google Sites API to retrieve content from a non-public Google Site. In essence, we're using the Google Site as a lightweight CMS, and from within the application I use the API to retrieve the site contents to feed my online help system. I've had this setup for a while and it's working without a hitch.

The issue

In my application, I need to add full-text search functionality to the online help system. I knew this feature request would come along at some point, so when deciding on Google Sites to host my content, I checked whether the Sites API supports full-text search. It does. For example, the following URL will search the entire site my-site for pages containing the keyword user.

https://sites.google.com/feeds/content/my.doma.in/my-site?q=user

This works, and gives me the expected result pages. But it does so only for content written in Western languages, or, more specifically, languages in which tokens/words are separated by whitespace and punctuation. When I run a similar search on my Japanese content, searching for the keyword ユーザー:

https://sites.google.com/feeds/content/my.doma.in/my-site?q=%E3%83%A6%E3%83%BC%E3%82%B6%E3%83%BC

I will only get result pages in which the search term appears as a bare string, i.e. delimited by either white-space or punctuation. Since Japanese is a language written in scriptio continua, this is not sufficient. Pages that contain, for example:

ご自身のユーザー基本情報の確認

will not show up in the results. So it seems that the search index that is used behind the scenes is created based on "Western" lexical rules, and that Japanese content is not correctly tokenized. However, when I search for the same keyword from the Google Site's Search this site field, I do get the correct results. I conclude that a correctly tokenized index exists, but it seems to be impossible to use it for an API-based search.

What I've tried so far

To remedy this situation, these are the avenues that I've explored so far:

  • I've tried looking for language settings in Google Sites itself. There's a general UI language setting which was already set to Japanese and has no impact on the API query results. There are no per-page or per-template language settings to force the indexer/tokenizer's hand.
  • I've tried quoting the search string with double quotes ("ユーザー").
  • I've tried including wildcards (*ユーザー*).
  • I've tried using additional language parameters to the URL that are common in other Google APIs: lang, hl (interface language), rl (results language),..
  • I've tried creating a Google Custom Search Engine, but it seems impossible to get it to work on a non-public Google Site.

So...

I'm quickly running out of ideas here. In a worst case scenario, I will end up having to retrieve, tokenize, and index all of the content myself and make it searchable that way. Since this will require a substantial effort, I would like to know if anyone has encountered the same issue and has found an acceptable workaround or solution.


Update 1

I have yet to find an elegant solution for this issue, so I raised a defect on the Google Apps APIs issue tracker: https://code.google.com/a/google.com/p/apps-api-issues/issues/detail?id=3780

Update 2

After some going back and forth, Google's engineers have acknowledged that the problem indeed exists as described, and have "filed the issue internally". The defect ticket has been stuck in triaged state ever since. If you, like me, are interested in seeing this issue resolved, please take a moment to star/vote for it on Google's issue tracker.

1

There are 1 answers

0
Kostyantyn Didenko On

I know how it feels when waiting for somebodies support to handle an API bug while your application is going to not met deadlines defined. The issue you described really sound like a bug, so for "clean" solution you will have to wait until Google Sites team guys will resolve this bug (I already upvoted :) ) and you will be able to simply use the Search API.

However, in a meanwhile, I think you should try some workarounds. I may suggest you a different solution which will not met your needs for 100% but may be useful. For example, configure your site to expose the aggregation data feed to feed processor with rich search API - it may be an RSS feed with all articles from your Google Site burned by Feedly which have a nice multi languages search API support (Search the content of a stream) along with strong authentication to protect your data privacy.

As an architect I know that this is not a proper solution for the issue, but once it helped me to build a fully searchable application aggregating data from 100+ different data sources using russian and ukrainian locales.

Have a good luck in your application development and let me know if this solution helped you! :)