Slicing a file with rabin karp algorithm

1.7k views Asked by At

I've written a c program that's supposed to slice a file into chunks with Rabin Karp algorithm. This is an adaptation of a c# program that you can find Here.

It seems to work, but a problem remains. average chunks size is not what is expected.

Usage is as follows:

rabin Prime WindowSize BoundaryMarker File

where :

Rabin is the name of the executable.

Prime is a high prime number. For instance 100007

WindowSize is the size of rolling window. For instance 48

BoundaryMarker is the number of bits set to 0 in a fingerprint

File is the file to process

if I set BoundaryMarker to 13, I expect the average chunk size to be 8K. in fact, none of them are around 8K.

I've hard time figuring out what's going wrong with my program ? Can you help me ?

thanks

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>

unsigned char* buffer;
int windowSize;
int writePointer = 0;
int readPointer = 0;
int dataSize = 0;

unsigned char PushChar(unsigned char c)

{ if (++writePointer >= windowSize) writePointer=0;
  buffer[writePointer]=c;
  dataSize++;
  return(c);
}

unsigned char PopChar(void)

{ if (++readPointer >= windowSize) readPointer=0;
  dataSize--;
  return(buffer[readPointer]);
}


int main(int argc, char *argv[])

{ int fd;
  unsigned char c;

  unsigned long Q;
  unsigned long D=256;
  unsigned long pow=1;
  int i,k,boundary,boundaryMarker,index;
  unsigned char s; 

  if (argc != 5) 
  { printf("\nUsage : rabin Prime WindowSize BoundaryMarker File\n\nwhere :\n");
    printf("Prime is a high prime number. For instance 100007\n\n");
    printf("WindowSize is the size of rolling window. For instance 48\n\n");
    printf("BoundaryMarker is the number of bits set to 0 in a fingerprint\n\n");
    printf("File is the file to process\n\n");
    return(1);
  }

  sscanf(argv[1],"%lu",&Q);
  sscanf(argv[2],"%d",&windowSize);
  sscanf(argv[3],"%d",&boundaryMarker);

  for(i=1,boundary=1;i<=boundaryMarker;i++) boundary=boundary*2;
  boundary --;

  //printf("Q = %lu windowSize = %d boundary = %d\n",Q,windowSize,boundary);

  if ((buffer=(unsigned char*) malloc (sizeof(unsigned char)*windowSize))==NULL) return(1);

  for (k=1; k < windowSize; k++) pow=(pow*D)%Q;
  //printf("pow value %lu\n",pow);

  unsigned long sig=0;
  int lastIndex=0;

  if ((fd=open(argv[4],O_RDONLY))<0) exit(1);

  for (i=0; i <windowSize; i++)
  { read(fd,&c,1);
    PushChar(c);
    sig=(sig*D + (unsigned long)c) %Q;
  }

  //printf("sig value = %lu\n",sig);

  index=0; lastIndex=0;

  while (read(fd,&c,1))
  { 
    s=PopChar();
    //printf("sig = ( %lu + %lu - %lu * %lu %% %lu ) %lu",sig,Q,pow,(unsigned long) s,Q,Q);
    sig = (sig + Q - pow*(unsigned long)s%Q)%Q;
    //printf(" = %lu\n",sig);
    s=PushChar(c);
    //printf("sig2 = ( %lu * %lu + %lu ) %% %lu",sig,D,(unsigned long) s,Q);
    sig = (sig*D + (unsigned long)s)%Q;
    //printf(" = %lu\n",sig);
    index++;
    if ((sig & boundary )==0)
       { if (index - lastIndex >= 2048)
         { printf("sig & boundary = %lu & %lu Index=%d chunk size=%d\n",sig,boundary,index,index-lastIndex);
           lastIndex=index;
     }
       }
    else if (index -lastIndex >=65536)
            { printf("sig & boundary = %lu & %lu Index=%d chunk size=%d\n",sig,boundary,index,index-lastIndex);
              lastIndex=index;
            }
  }
  printf("Index=%d chunk size=%d\n",index,index-lastIndex);

  close(fd);
  return 1;
}
2

There are 2 answers

0
lxgeek On

You can try to update the BoundaryMarker value, you can get the different lengths. I have use RB in this way:github link. And I think length actually rely on contents.

0
Ilmari Karonen On

Running your code, with BoundaryMarker = 13, on a megabyte of random data gave me 104 chunks, for an average chunk size of 10082 bytes. That's not too far off from the expected 8192.

However, smaller BoundaryMarker values show a more noticeable bias; setting it to 10, for example, gave me an average chunk size of 3049 bytes, rather far from the expected 1024. And setting BoundaryMarker = 5 yielded an average chunk size of 2077 bytes, nowhere even near the expected size of 32 bytes.

Looking more closely at your code, the obvious cause of this bias is in the following code (reformatted for clarity):

if ((sig & boundary ) == 0)
{ if (index - lastIndex >= 2048)
  { printf("sig & boundary = %lu & %lu Index=%d chunk size=%d\n",sig,boundary,index,index-lastIndex);
    lastIndex=index;
  }
}
else if (index - lastIndex >= 65536)
{ printf("sig & boundary = %lu & %lu Index=%d chunk size=%d\n",sig,boundary,index,index-lastIndex);
  lastIndex=index;
}

The if (index - lastIndex >= 2048) suppresses chunk boundaries that are less than 2048 bytes from the previous boundary, effectively merging chunks shorter than 2048 bytes with the following chunk. The else if (index - lastIndex >= 65536) check, meanwhile, forces an artificial chunk boundary to prevent any chunks from growing longer than 65536 bytes.

If this behavior (which forces all chunks to be at least 2048 and at most 65536 bytes long) isn't what you want, you can simply remove those checks, simplifying the code to just:

if ((sig & boundary ) == 0)
{ printf("sig & boundary = %lu & %lu Index=%d chunk size=%d\n",sig,boundary,index,index-lastIndex);
  lastIndex=index;
}

Indeed, making this change yields an average chunk size very close to 2n bytes for BoundaryMarker = n, at least for n ≤ 12 or so.

For n = 13, there does seem to be noticeable downward bias, which I suspect to be caused by the fact that the prime 100007 is only about 12.2 times the boundary modulus 213. As the signature values are more or less randomly distributed modulo the prime, that extra 0.2 causes them to be slightly biased towards smaller values (including zero) when further reduced modulo 213.

This bias can be easily fixed by using a larger prime, such as 231−1 = 2147483647. Indeed, switching to this prime makes the average chunk size much closer to 8192.