Why is "size" wrong by a few digits for large data sets?

89 views Asked by At

EDIT: I have fixed the mistakes and the program returns correct output for all input. The main mistakes were:

  • fillGap-method was not implemented correctly. It moved the next element after the gap to fill the gap, even if that element didn't hash to that index. fillGap should only fill the empty space with an element that hashes to that index.

  • "size" was decremented incorrectly in the remove-method. My program reduced size even if the element didn't exist in the array. I could've used my contains-method, but didn't want to because I think it is inefficient. There are of course other ways to make sure the element exists in the array before reducing size, without using the contains-method.

Below is my correct code (it could probably be more efficient, but it works):


import java.io.File;
import java.io.FileNotFoundException;
import java.util.*;

class Hashing{

    protected int size = 0;
    protected Scanner fileScanner = null;
    protected int[] array;
    protected boolean isRehashing = false;

    public Hashing(){
        array = new int[5];
        emptySpots();
    }

    public int hashValue(int x){
        return (x % array.length + array.length) % array.length;
    }

    public double loadFactor(){
        return (double) size / array.length;
    }

    public void emptySpots(){
        for (int i = 0; i < array.length; i++){
            array[i] = Integer.MIN_VALUE;
        }
    }

    public boolean contains(int x){
        int hashValue = hashValue(x);
        if (array[hashValue] == Integer.MIN_VALUE){ // Empty spot
            return false;
        }
        else{ // Spot is not empty -- we explore
            if (array[hashValue] == x){ // x is at the index it hashes to
                return true;
            }
            // x must be at at an index after hashValue, or it does not exist and we return false
            int index = hashValue;
            while (array[index] != Integer.MIN_VALUE){
                index = (index + 1) % array.length;
                if (index == hashValue){
                    return false;
                }
                if (array[index] == x){
                    return true;
                }
            }
        }
        return false;
    }

    public void rehash(){
        System.out.println("rehash");
        isRehashing = true;
        int[] oldArray = array; 
        array = new int[oldArray.length * 2]; 
        emptySpots();
        for (int i = 0; i < oldArray.length; i++){
            if (oldArray[i] != Integer.MIN_VALUE){
                insert(oldArray[i]);
            }
        }
        isRehashing = false;
    }

    public void insert (int x){
        System.out.println("insert");
        if (contains(x)){ // We do not allow duplicates
            return;
        }
        int hashValue = hashValue(x);
        if (array[hashValue] == Integer.MIN_VALUE){ // The spot that x hashes to is empty
            array[hashValue] = x;
        }
        else{
            int index = hashValue;
            while (array[index] != Integer.MIN_VALUE){
                index = (index + 1) % array.length;
                if (array[index] == Integer.MIN_VALUE){
                    array[index] = x;
                    break;
                }
            }
        }
        if (!(isRehashing)){
            size++;
            if (loadFactor() > 0.7){ // If the array is more than 70% full, we rehash
                rehash();
            }
        }
    }

    public void remove(int x){
        int hashValue = hashValue(x);
        int index = hashValue;
        if (array[hashValue] == Integer.MIN_VALUE){ // There are no elements with this hashValue in the array
            return;
        }
        if (array[hashValue] == x){ // x is at the index it hashes to
            array[hashValue] = Integer.MIN_VALUE;
            size--; // Decrement size as element is removed
            fillGap(hashValue, hashValue);
            return;
        }
        // We have to iterate through the array and see if we can find x
        index = (index + 1) % array.length;
        while (index != hashValue){
            if (array[index] == x){
                array[index] = Integer.MIN_VALUE;
                size--; // Decrement size as element is removed
                fillGap(index, index);
                return;
            }
            index = (index + 1) % array.length;
        }
    }

    public void fillGap(int index, int hashValue){ // Recursive
        System.out.println("fillGap");
        int nextIndex = (index + 1) % array.length;
        if (array[nextIndex] == Integer.MIN_VALUE){ // The next spot is empty, so we return
            return;
        }
        if (hashValue(array[nextIndex]) == hashValue){ // Found a replacement at the next spot 
            array[hashValue] = array[nextIndex];
            array[nextIndex] = Integer.MIN_VALUE;
            fillGap(nextIndex, nextIndex);
        }
        else{
            fillGap(nextIndex, hashValue);
        }
    }

    public int size(){
        return size;
    }

    public void readFile(String fileName){
        try {
            fileScanner = new Scanner(new File(fileName));
        } 
        catch (FileNotFoundException e) {
            System.out.println("Could not read from file " + fileName);
            System.exit(-1);
        }
        while (fileScanner.hasNextLine()){
            String[] parts = fileScanner.nextLine().split(" ");
            if (parts[0].contains("insert")){
                insert(Integer.parseInt(parts[1]));
            }
            if (parts[0].contains("remove")){
                remove(Integer.parseInt(parts[1]));
            }
            if (parts[0].contains("contains")){
                System.out.println(contains(Integer.parseInt(parts[1])));
            }
            if (parts[0].contains("size")){
                System.out.println(size());
            }
        }
        fileScanner.close();
    }

    public static void main(String[] args){
        Hashing hashing  = new Hashing();
        if (args.length > 0){
            hashing.readFile(args[0]);
        }
    }
}


EDIT: I made several changes to the code below while trying to solve the issue, so the code below is not the same as the one I inserted here the first time. I still wanted to include a wrong version of my code, if it helps anyone.

I am writing a Linear Probing-program and testing it with input-files. I compare my output with output-files I've been provided, and my "size" is off by a few numbers.

This is my code:

import java.io.File;
import java.io.FileNotFoundException;
import java.util.*;

class Hashing{

    protected int size = 0;
    protected Scanner fileScanner = null;
    protected int[] array;
    protected boolean isRehashing = false;

    public Hashing(){
        array = new int[5];
        emptySpots();
    }

    public int hashValue(int x){
        return (x % array.length + array.length) % array.length;
    }

    public double loadFactor(){
        return (double) size / array.length;
    }

    public void emptySpots(){
        for (int i = 0; i < array.length; i++){
            array[i] = Integer.MIN_VALUE;
        }
    }

    public boolean contains(int x){
        int hashValue = hashValue(x);
        if (array[hashValue] == Integer.MIN_VALUE){ // Empty spot
            return false;
        }
        else{ // Spot is not empty -- we explore
            if (array[hashValue] == x){ // x is at the index it hashes to
                return true;
            }
            // x must be at at an index after hashValue, or it does not exist and we return false
            int index = hashValue;
            while (array[index] != Integer.MIN_VALUE){
                index = (index + 1) % array.length;
                if (index == hashValue){
                    return false;
                }
                if (array[index] == x){
                    return true;
                }
            }
        }
        return false;
    }

    public void rehash(){
        System.out.println("rehash");
        isRehashing = true;
        int[] oldArray = array; 
        array = new int[oldArray.length * 2]; 
        emptySpots();
        for (int i = 0; i < oldArray.length; i++){
            if (oldArray[i] != Integer.MIN_VALUE){
                insert(oldArray[i]);
            }
        }
        isRehashing = false;
    }

    public void insert (int x){
        System.out.println("insert");
        if (contains(x)){ // We do not allow duplicates
            return;
        }
        int hashValue = hashValue(x);
        if (array[hashValue] == Integer.MIN_VALUE){ // The spot that x hashes to is empty
            array[hashValue] = x;
        }
        else{
            int index = hashValue;
            while (array[index] != Integer.MIN_VALUE){
                index = (index + 1) % array.length;
                if (array[index] == Integer.MIN_VALUE){
                    array[index] = x;
                    break;
                }
            }
        }
        if (!(isRehashing)){
            size++; // Since rehashing calls upon insert but doesn't affect the size, we do not increase size when rehashing
            if (loadFactor() > 0.7){ // If the array is more than 70% full, we rehash
                rehash();
            }
        }
    }

    public void remove(int x){
        System.out.println("remove");
        int hashValue = hashValue(x);
        int index = hashValue;
        if (array[hashValue] == Integer.MIN_VALUE){ // There are no elements with this hashValue in the array
            return;
        }
        if (array[hashValue] == x){ // x is at the index it hashes to
            array[hashValue] = Integer.MIN_VALUE;
        }
        else{ // We have to iterate through the array and see if we can find x
            while (array[index] != Integer.MIN_VALUE){
                index = (index + 1) % array.length;
                if (index == hashValue){
                    return;
                }
                if (array[index] == x){
                    array[index] = Integer.MIN_VALUE;
                }
            }
        }
        size--;
        fillGap(index);
    }

    public void fillGap(int index){ // Recursive
        int nextIndex = (index + 1) % array.length;
        if (array[nextIndex] == Integer.MIN_VALUE){ // The next spot is empty, so we return
            return;
        }
        array[index] = array[nextIndex];
        array[nextIndex] = Integer.MIN_VALUE;
        fillGap(nextIndex);
    }
    

    public int size(){
        return size;
    }

    public void readFile(String fileName){
        try {
            fileScanner = new Scanner(new File(fileName));
        } 
        catch (FileNotFoundException e) {
            System.out.println("Could not read from file " + fileName);
            System.exit(-1);
        }
        while (fileScanner.hasNextLine()){
            String[] parts = fileScanner.nextLine().split(" ");
            if (parts[0].contains("insert")){
                insert(Integer.parseInt(parts[1]));
            }
            if (parts[0].contains("remove")){
                remove(Integer.parseInt(parts[1]));
            }
            if (parts[0].contains("contains")){
                System.out.println(contains(Integer.parseInt(parts[1])));
            }
            if (parts[0].contains("size")){
                System.out.println(size());
            }
        }
        fileScanner.close();
    }

    public static void main(String[] args){
        Hashing hashing  = new Hashing();
        if (args.length > 0){
            hashing.readFile(args[0]);
        }
        /*System.out.println("List:");
        int count = 0;
        for (int i = 0; i < hashing.array.length; i++){
            System.out.println(hashing.array[i]);
            if (hashing.array[i] != Integer.MIN_VALUE){
                count++;
            }
            
        }
        System.out.println("Count: " + count);*/
    }


}

1

There are 1 answers

2
Maks Verver On

I think remove() is implemented incorrectly: you clear the slot if the value is found, but ignore the fact that subsequent slots may contain values that hashed to the same or preceding slots. This makes these values unreachable in subsequent calls

Here's an example to help you understand the problem. Imagine you have a hash table with 3 slots, all initially empty ([-1, -1, -1]) and you insert two values, a and b, which both happen to hash to slot 0. Then after inserting, your hash table looks like this: [a, b, -1]. Now if you erase e.g. element a, which is in slot 0, the hash table becomes [-1, b, -1]. Since slot 0 is empty, value b is no longer findable. contains(b) will return false (incorrectly) and insert(b) will re-insert it into slot 0 (incorrectly), creating a hash table [b, b, -1] that contains the same value twice, which explains why the size is wrong.

The standard solutions to this problem are to either rewrite the empty slot with a tombstone value (e.g. -2) which doesn't terminate the search in contains() or insert(), or to remove the gap in the linear probing sequence by moving subsequent elements backward (e.g. when deleting a, [a, b, -1] becomes [b, -1, -1]).

For the first solution, you must ensure that tombstones are still counted towards the load factor. The second solution is possible but tricky to get right, because a linear probing sequence might contain elements that hash to different slots, and they don't even need to occur in order! So if this is a toy problem, you probably want to stick with tombstones since they're much easier to implement correctly.