Issue with ZstdNet library: "Src size is incorrect" exception

89 views Asked by At

I am currently experimenting with the ZstdNet library for compressing small text using a pre-generated dictionary. While the compression works correctly in most cases, I encounter an exception with the message "Src size is incorrect" under certain circumstances.

I have created a minimal test to reproduce the issue:

var text =  "bla bla, bla bla bla";
var bytes = Encoding.UTF8.GetBytes(text);
var dic = ZstdNet.DictBuilder.TrainFromBuffer(text.Split(' ').Select(Encoding.UTF8.GetBytes) );

Any insights or suggestions?

1

There are 1 answers

0
VonC On BEST ANSWER

Check first if this is related to the warning included in the comment of that method:

/*! ZDICT_trainFromBuffer():
 *  Train a dictionary from an array of samples.
 *  Redirect towards ZDICT_optimizeTrainFromBuffer_fastCover() single-threaded, with d=8, steps=4,
 *  f=20, and accel=1.
 *  Samples must be stored concatenated in a single flat buffer `samplesBuffer`,
 *  supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order.
 *  The resulting dictionary will be saved into `dictBuffer`.
 * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)
 *          or an error code, which can be tested with ZDICT_isError().
 *  Note:  Dictionary training will fail if there are not enough samples to construct a
 *         dictionary, or if most of the samples are too small (< 8 bytes being the lower limit).
 *         If dictionary training fails, you should use zstd without a dictionary, as the dictionary
 *         would've been ineffective anyways. If you believe your samples would benefit from a dictionary
 *         please open an issue with details, and we can look into it.
 *  Note: ZDICT_trainFromBuffer()'s memory usage is about 6 MB.
 *  Tips: In general, a reasonable dictionary has a size of ~ 100 KB.
 *        It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`.
 *        In general, it's recommended to provide a few thousands samples, though this can vary a lot.
 *        It's recommended that total size of all samples be about ~x100 times the target size of dictionary.
 */
ZDICTLIB_API size_t ZDICT_trainFromBuffer(void* dictBuffer, size_t dictBufferCapacity,
                                    const void* samplesBuffer,
                                    const size_t* samplesSizes, unsigned nbSamples);

(This is from the C++ facebook/zstd, but the same idea applies to the wrapper library skbkontur/ZstdNet)

From that comment, the warning is:

Dictionary training will fail if there are not enough samples to construct a dictionary, or if most of the samples are too small (< 8 bytes being the lower limit).

If dictionary training fails, you should use zstd without a dictionary, as the dictionary would've been ineffective anyways.

If you believe your samples would benefit from a dictionary, please open an issue with details, and we can look into it.

That method expects a collection of byte arrays as training samples for dictionary creation, but the way you are generating these samples may not be suitable for all scenarios, especially with the minimal and repetitive text example you provided.

So make sure the input text provides enough unique samples for dictionary training. A larger and more varied dataset might be necessary.
Try and manually creating a larger and more diverse set of samples for the dictionary training process if your use case involves small or very specific text samples.

var text =  "bla bla, bla bla bla";
var bytes = Encoding.UTF8.GetBytes(text);
// Make sure the text split results in sufficiently diverse and numerous byte arrays for training
var samples = new List<byte[]>{
    Encoding.UTF8.GetBytes("sample text 1"),
    Encoding.UTF8.GetBytes("sample text 2"),
    // Add more varied samples
};
var dic = ZstdNet.DictBuilder.TrainFromBuffer(samples);