C++ / Fast random access skipping in a large file

1.7k views Asked by At

I have large files, containing a small number of large datasets. Each dataset contains a name and the dataset size in bytes, allowing to skip it and go to the next dataset.

I want to build an index of dataset names very quickly. An example of file is about 21MB large, and contains 88 datasets. Reading the 88 names quickly by using a std::ifstream and seekg() to skip between datasets takes about 1300ms, which I would like to reduce.

So in fact, I'm reading 88 chunks of about 30 bytes, at given positions in a 21MB file, and it takes 1300ms.

Is there a way to improve this, or is it an OS and filesystem limitation? I'm running the test under Windows 7 64bit.

I know that having a complete index at the beginning of the file would be better, but the file format does not have this, and we can't change it.

3

There are 3 answers

0
norca On BEST ANSWER

You could scan the file and make your own header with the key and the index in a seperate file. Depending on your use case you can do it once at program start and everytime the file changes. Before accessing the big data, a lookup in the smaller file gives you the needed index.

0
Francis Cugler On

You may be able to do a buffer queuing process with multithreading. You could create a custom struct that would store various amounts of data.

You said:

Each dataset contains a name and the dataset size in bytes, allowing to skip it and go to the next dataset.

So as opening and closing the files over and over again is slow you could read the file all in one go and store it into a full buffer object and then parse it or store it into batches. This would also depend on if you are reading in text or binary mode on how easy it is to parse the file. I'll demonstrate the later with populating multiple batches while reading in a buffered size amount of data from file.

Pseudo Code

struct Batch {
    std::string name; // Name of Dataset
    unsigned size;    // Size of Dataset
    unsigned indexOffset;  // Index to next read location
    bool empty = true;     // Flag to tell if this batch is full or empty
    std::vector<DataType> dataset; // Container of Data
}; 

std::vector<Batch> finishedBatches;

// This doesn't matter on the size of the data set; this is just a buffer size on how much memory to digest in reading the file
const unsigned bufferSize = "Set to Your Preference" 1MB - 4MB etc.

void loadDataFromFile( const std::string& filename, unsigned bufferSize, std::vector<Batch>& batches ) {

    // Set ifstream's buffer size 

    // OpenFile For Reading and read in and upto buffer size

    // Spawn different thread to populate the Batches and while that batch is loading 
    // in data read in that much buffer data again. You will need to have a couple local 
    // stack batches to work with. So if a batch is complete and you reached the next index 
    // location from the file you can fill another batch.

    // When a single batch is complete push it into the vector to store that batch.
    // Change its flag and clear its vector and you can then use that empty batch again.

    // Continue this until you reached end of file.           

}

This would be a 2 threaded system here. Main thread for opening and reading from file and seeking from file with a worker thread filling the batches and pushing batches into container and swapping to use next batch.

5
mascoj On

You could use a memory mapped file interface (I recommend boost's implementation.)

This will open the file into the virtual page for your application for quicker lookup time, without going back to the disk.