Compression of Database using FP trees

78 views Asked by At

I have a dataset of around 200 MB, containing only integers. My goal is to use FP-trees to compress this dataset and reduce the final output file size. This is for academic purposes so I can't use zipping software and restricted to FP-trees. The size of the output file is determined by the count of integers in the output and the size of the associated mapping table. To illustrate, consider this example:

    T1 - 1, 2, 3, 4, 5
    T2 - 1, 2, 3, 4, 6
    T3 - 1, 2, 3, 4, 5, 7
    T4 - 1, 2, 3, 4, 5, 6, 7

With a mapping like:

{
    10 : {1, 2, 3, 4},
    11 : {5, 7},
}

The transformed dataset becomes:

    T1 - 10, 5
    T2 - 10, 6
    T3 - 10, 11
    T4 - 10, 11, 6

However, alternative mappings are possible, like:

    {10: (1, 2), 11: (3, 4, 5), 12: (1, 2, 3), 13: (4, 5)}

I've successfully implemented the FP-tree algorithm to identify frequent itemsets along with their counts. However, due to the significant time required for constructing an FP-tree on large datasets, I've divided the dataset into smaller sections for compression.

My main objective is to convert each transaction into a set of keys from the mapping, with the aim of reducing the overall size. However, the difficulty lies in identifying segments within each transaction that can be replaced with a key to achieve effective size reduction. Therefore, I'm looking for effective strategies or heuristics that can assist in making these selections and optimizing the compression process. I welcome any insights or recommendations you may have to help tackle this challenge.

0

There are 0 answers