Calculating levenshtein distance between two strings

1.9k views Asked by At

Im executing the following Postgres query.

SELECT *  FROM description WHERE levenshtein(desci, 'Description text?') <= 6  LIMIT 10;

Im using the following code execute the above query.

public static boolean authQuestion(String question) throws SQLException{
    boolean isDescAvailable = false;
    Connection connection = null;
    try {
        connection = DbRes.getConnection();
        String query = "SELECT *  FROM description WHERE levenshtein(desci, ? ) <= 6";
        PreparedStatement checkStmt = dbCon.prepareStatement(query);
        checkStmt.setString(1, question);
        ResultSet rs = checkStmt.executeQuery();
        while (rs.next()) {     
            isDescAvailable = true;
        }
    } catch (URISyntaxException e1) {
        e1.printStackTrace();
    } catch (SQLException sqle) {
        sqle.printStackTrace();
    } catch (Exception e) {
        if (connection != null)
            connection.close();
    } finally {
        if (connection != null)
            connection.close();
    }
    return isDescAvailable;
}

I want to find the edit distance between both input text and the values that's existing in the database. i want to fetch all datas that has edit distance of 60 percent. The above query doesnt work as expected. How do I get the rows that contains 60 percent similarity?

2

There are 2 answers

4
Bohemian On BEST ANSWER

Use this:

SELECT *
FROM description
WHERE 100 * (length(desci) - levenshtein(desci, ?))
         / length(desci) > 60

The Levenshtein distance is the count of how many letters must change (move, delete or insert) for one string to become the other. Put simply, it's the number of letters that are different.

The number of letters that are the same is then length - levenshtein.

To express this as a fraction, divide by the length, ie (length - levenshtein) / length.

To express a fraction as a percentage, multiply by 100.

I perform the multiplication by 100 first to avoid integer division truncation problems.

0
Adam On

The most general version of the levenshtein function is:

levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int

Both source and target can be any non-null string, with a maximum of 255 characters. The cost parameters specify how much to charge for a character insertion, deletion, or substitution, respectively. You can omit the cost parameters, as in the second version of the function; in that case they all default to 1.

So, with the default cost parameters, the result you get is the total number of characters you need to change (by insertion, deletion, or substitution) in the source to get the target.

If you need to calculate the percentage difference, you should divide the levenshtein function result by the length of your source text (or target length - according to your definition of the percentage difference).