StarDict support for JavaScript and a Firefox OS App

988 views Asked by At

I wrote a dictionary app in the spirit of GoldenDict (www.goldendict.org, also see Google Play Store for more information) for Firefox OS: http://tuxor1337.github.io/firedict and https://marketplace.firefox.com/app/firedict

Since apps for ffos are based on HTML, CSS and JavaScript (WebAPI etc.), I had to write everything from scratch. At first, I wrote a basic library for synchronous and asynchronous access to StarDict dictionaries in JavaScript: https://github.com/tuxor1337/stardict.js

Although the app can be called stable by now, overall performance is still a bit sluggish. For some dictionaries, I have a list of words of almost 1,000,000 entries! That's huge. Indexing takes a really long time (up to several minutes per dictionary) and lookup as well. At the moment, the words are stored in an IndexedDB object store. Is there another alternative? With the current solution (words accessed and inserted using binary search) the overall experience is pretty slow. Maybe it would become faster, if there was some locale sort support by IndexedDB... Actually, I'm not even storing the terms themselves in the DB but only their offsets in the *.syn/*.idx file. I hope to save some memory doing that. But of course I'm not able to use any IDB sorting functionality with this configuration...

Maybe it's not the best idea to do the sorting in memory, because now the app is killed by the kernel due to an OOM on some devices (e.g. ZTE Open). A dictionary with more than 500,000 entries will definitely exceed 100 MB in memory. (That's only 200 Byte per entry and if you suppose the keyword strings are UTF-8, you'll exceed 100 MB immediately...)

Feel free to contribute directly to the project on GitHub. Otherwise, I would be glad to hear your advice concerning the above issues.

1

There are 1 answers

2
Feng Dihai On

I am working on a pure Javascript implementation of MDict parser (https://github.com/fengdh/mdict-js) simliliar to your stardict project. MDict is another popular dictionary format with rich format (embeded image/audio/css etc.), which is widely support on window/linux/ios/android/windows phone. I have some ideas to share, and wish you can apply it to improve stardict.js in future.

MDict dictionary file (mdx/mdd) divides keyword and record into (optionaly compressed) block each contains around 2000 entries, and also provides a keyword block index table and record block index table to help quick look-up. Because of its compact data structure, I can implement my MDict parser scanning directly on dictionary file with small pre-load index table but no need of IndexDB.

  • Each keyword block index looks like:

    {num_entries: .., 
     first_word: .., 
     last_word: .., 
     comp_size: ..,    // size in compression 
     decomp_size: ..,  // size after decompression
     offset: ..,       // offset in mdx file
     index: ..
    }
    
  • In keyblock, each entries is a pair of [keyword, offset]

  • Each record block index looks like:

    {comp_size: ..,    // size in compression 
     decomp_size: ..,  // size after decompression
    }
    
  • Given a word, use binary search to locate the keyword block maybe containing it.

  • Slice the keyword block and Load all keys in it, filter out matched one and get its record offfset.
  • Use binary search to locate the record block containing the word's record.
  • Slice the record block and retrieve its record (a definition in text or resource in ArrayBuffer) directly.

Since each block contains only around 2000 entries, it is fast enough to lookup word among 100K~1M dictionary entries within 100ms, quite decent value for human interaction. mdict-js parses file head only, it is super fast and of low memory usage.

In the same way, it is possible to retrieve a list of neighboring words for given phrase, even with wild card.

Please take a look on my online demo here: http://fengdh.github.io/mdict-js/ (You have to choose a local MDict dictionary: a mdx + optional mdd file)