Efficient algorithm(s) to compress strings containing windows file paths

338 views Asked by At

I need to devise an efficient way to encode/decode multiple strings containing windows file paths e.g. C:\Users\Public\Documents\CompanyName\ApplicationName\VersionNumber\Filename.ext on an embedded system with limited long-term storage.

Currently, we take 3 chars and convert them to a single unique integer which we then store in one of the register locations. Since there are only ~500 locations for storage for the entire unit, it is quite obvious that using 1 register for 3 chars is not a good solution.

Application workflow:

  1. User selects a file on Windows PC.
  2. Filename is encoded as described above and sent to embedded system along with other information (not related to the filename) to the persisted storage.
  3. Operator chooses when to execute the information sent in step 2. It may be far in the future.
  4. Embedded system does operations.
  5. Information (including decoded filename) is sent back to Windows PC.
  6. Windows PC updates the file with the operation results.

Notes:

  1. Processing power (CPU) is not a constraint for this mixed system (Windows PC and embedded system). Currently, we do the encoding on the Windows PC and the decoding on the embedded system but this need not be the case.
  2. The Windows file paths are usually in 1 of a couple locations but the customer can change the default file location to whatever they desire and they often do.
  3. The revised algorithm will most likely be implemented in C++.

What would be some good algorithms to consider for this encoding/decoding?

Please let me know if i have forgotten any important details. I've tried to be as thorough as possible but compression is definitely not my expertise.

2

There are 2 answers

2
Saeed Amiri On

As there are many similar prefixes in the path names, you can use trie. This saves a lot of space and it is also fast for retrieval. There are many free implementations in the internet and implementing one is also simple.

Here is a bit more explanation why this is useful. Let Consider each file path as a single string. Many of those strings have common prefix, e.g. the string C:\Users\Public\Documents\ will appear very often. It maybe the case even you have something like

C:\Users\Public\Documents\file1
C:\Users\Public\Documents\file2
.....
C:\Users\Public\Documents\file10000

Then the whole prefix C:\Users\Public\Documents\file appears in many files and we do not need to save them all. But we also don't know how is the structure (as it is dynamic not static) so we cannot hard code to save the prefix x. But trie helps to maintain the whole strings in small space. e.g. in every very huge text search engine there is a trie like structure. Because they cannot save all the row texts as it is costly and needs a lot of hardware, and more important than that, it is hard to find an specific text among billions of lines of texts. Instead they make it compact with structure like trie.

There are other structures like Huffman coding which is relatively efficient for compacting huge database of strings, but in your particular case, I think you are not looking just for compacting your strings, but you want to be able to query and quickly find relevant information. So trie will helps.

1
johnr On

Use zlib with a dictionary. A better solution depends on knowing the space constraints for both data and program on the embedded side, the costs of updating the data and program, the frequency and size of updates, the correlation of updated to previous contents, etc.