Comparing 2 markdown files in python

77 views Asked by At

I have 2 markdown files lets say old_file.md and new_file.md. I want a similarity percentage output for the new_file to the old_file and also all the data that i missed during creating the new_file.

Old_file.md:

10

For some cases a higher input temperature can be allowed when requested and checked by the manufacturer.

The outlet water temperature rise is from 5–10 K.

The minimum pressure and amount of cooling water for the basic construction of a water-cooled motor is shown in the following table. Please check the requirements for pressure and the amount of cooling water in the case of special constructions.

If the amount of water varies, its temperature rise will be inversely proportional to the flow rate.

### 9.2. Filling or draining cooling water

When filling, open the air vent plug on top of the motor (see figure 2). Let the cooling water flow into the motor until it comes out of the air gap. Close the air gap with a plug and seal the joint with sealing tape or strip. Filling must be done carefully so that no air is left in the motor’s cooling channels. Check for possible leaks after the piping and joints have been connected.

Emptying can be done with pressurized air. After emptying, the plugs must be re-fitted, and the seals of the joints must be checked.

## 10. Consendation drain holes

It is of special importance with water cooled motors that the condensation drain holes are located in the correct position (fig. 1). Check that the condensation drain holes face downwards, especially when the mounting arrangement differs from standard.

## 11. Water leakage detector

This motor is equipped with float type leakage detector in non-drive end (see figure 2 and 3). The leakage detector has a magnetic float switch. The magnetic float switch is positioned on a non-magnetic guide tube. When a specified water level is reached, the magnetic field produced by the magnet in the float actuates a reed switch (sealed contact) inside the guide tube.

This closes the electric circuit that transmits the alarm signal to the control board.

3GZF500725-144 EN 03-2023 | ABB IEC LV MOTORS

new_file.md:

<p>10</p><p>3GZF500725-144 EN 03-2023 | ABB IEC LV MOTORS</p><p>For some cases a higher input temperature can be allowed when requested and checked by
the manufacturer.</p><p>The outlet water temperature rise is from 5–10 K.</p><p>The minimum pressure and amount of cooling water for the basic construction of a water-
cooled motor is shown in the following table. Please check the requirements for pressure and
the amount of cooling water in the case of special constructions.</p><p>If the amount of water varies, its temperature rise will be inversely proportional to the flow
rate.</p><p>9.2.
Filling or draining cooling water</p><p>When filling, open the air vent plug on top of the motor (see figure 2). Let the cooling water
flow into the motor until it comes out of the air gap. Close the air gap with a plug and seal
the joint with sealing tape or strip. Filling must be done carefully so that no air is left in the
motor’s cooling channels. Check for possible leaks after the piping and joints have been con-
nected.</p><p>Emptying can be done with pressurized air. After emptying, the plugs must be re-fitted, and
the seals of the joints must be checked.</p><p>10.
Consendation drain holes</p><p>It is of special importance with water cooled motors that the condensation drain holes are lo-
cated in the correct position (fig. 1). Check that the condensation drain holes face down-
wards, especially when the mounting arrangement differs from standard.</p><p>11.
Water leakage detector</p><p>This motor is equipped with float type leakage detector in non-drive end (see figure 2 and 3).
The leakage detector has a magnetic float switch. The magnetic float switch is positioned on
a non-magnetic guide tube. When a specified water level is reached, the magnetic field pro-
duced by the magnet in the float actuates a reed switch (sealed contact) inside the guide
tube.</p><p>This closes the electric circuit that transmits the alarm signal to the control board.</p>

I am using difflib to compare the two files and here is the code i am using:

from difflib import ndiff

def compare_files_take_5(gold_file_path, predicted_file_path, threshold=0.8):
  try:
    with open(gold_file_path, 'r') as gold_file, open(predicted_file_path, 'r') as predicted_file:
      gold_content = gold_file.read()
      predicted_content = predicted_file.read()

    differences = list(ndiff(gold_content, predicted_content))
    
    added_text = ''.join([diff[2:] for diff in differences if diff.startswith('+')])
    deleted_text = ''.join([diff[2:] for diff in differences if diff.startswith('-')])

    added_length = len(added_text)
    deleted_length = len(deleted_text)

    total_length = max(len(gold_content), len(predicted_content))

    similarity_ratio = 1 - (added_length + deleted_length) / total_length

    is_similar = similarity_ratio >= threshold

    return is_similar, similarity_ratio, added_length, deleted_length, added_text, deleted_text

  except Exception as e:
    print(f"Error: {e}")
    return False, 0, 0, 0, '', ''

# Example usage
gold_standard_file = "/Old_file.md"
predicted_file = "/New_file.md"

is_similar, similarity_ratio, added_length, deleted_length, added_text, deleted_text = compare_files_take_5(gold_standard_file, predicted_file)

print(f"Similarity Ratio: {similarity_ratio:.2%}")
print(f"Is Similar: {is_similar}")
print(f"Added Length: {added_length} characters")
print(f"Deleted Length: {deleted_length} characters")

print("\nAdded Text:")
print(added_text)

print("\nDeleted Text:")
print(deleted_text)

The following is the output I got which we visually know is not right:

Similarity Ratio: -91.44%
Is Similar: False
Added Length: 1925 characters
Deleted Length: 1854 characters

Added Text:
<p></><></p><p>For some cases a higher input temperature can be allowed when requested and checked by
the manufacturer.</p><p>The outlet water temperature rise is from 5–10 K.</p><p>The minimum pressure and amount of cooling water for the basic construction of a water-
cooled motor is shown in the following table. Please check the requirements for pressure and
the amount of cooling water in the case of special constructions.</p><p>If the amount of water varies, its temperature rise will be inversely proportional to the flow
rate.</p><p>9.2.
Filling or draining cooling water</p><p>When filling, open the air vent plug on top of the motor (see figure 2). Let the cooling water
flow into the motor until it comes out of the air gap. Close the air gap with a plug and seal
the joint with sealing tape or strip. Filling must be done carefully so that no air is left in the
motor’s cooling channels. Check for possible leaks after the piping and joints have been con-
nected.</p><p>Emptying can be done with pressurized air. After emptying, the plugs must be re-fitted, and
the seals of the joints must be checked.</p><p>10.
Consendation drain holes</p><p>It is of special importance with water cooled motors that the condensation drain holes are lo-
cated in the correct position (fig. 1). Check that the condensation drain holes face down-
wards, especially when the mounting arrangement differs from standard.</p><p>11.
Water leakage detector</p><p>This motor is equipped with float type leakage detector in non-drive end (see figure 2 and 3).
The leakage detector has a magnetic float switch. The magnetic float switch is positioned on
a non-magnetic guide tube. When a specified water level is reached, the magnetic field pro-
duced by the magnet in the float actuates a reed switch (sealed contact) inside the guide
tube.</p><p>This closes the electric circuit that transmits the alarm signal to the control board.</p>

Deleted Text:


For some cases a higher inut temerature can be allowed when requested and checked by the manufacturer.

The outlet water temperature rise is from 5–10 K.

The minimum pressure and amount of cooling water for the basic construction of a water-cooled motor is shown in the following table. Please check the requirements for pressure and the amount of cooling water in the case of special constructions.

If the amount of water varies, its temperature rise will be inversely proportional to the flow rate.

### 9.2. Filling or draining cooling water

When filling, open the air vent plug on top of the motor (see figure 2). Let the cooling water flow into the motor until it comes out of the air gap. Close the air gap with a plug and seal the joint with sealing tape or strip. Filling must be done carefully so that no air is left in the motor’s cooling channels. Check for possible leaks after the piping and joints have been connected.

Emptying can be done with pressurized air. After emptying, the plugs must be re-fitted, and the seals of the joints must be checked.

## 10. Consendation drain holes

It is of special importance with water cooled motors that the condensation drain holes are located in the correct position (fig. 1). Check that the condensation drain holes face downwards, especially when the mounting arrangement differs from standard.

## 11. Water leakage detector

This motor is equipped with float type leakage detector in non-drive end (see figure 2 and 3). The leakage detector has a magnetic float switch. The magnetic float switch is positioned on a non-magnetic guide tube. When a specified water level is reached, the magnetic field produced by the magnet in the float actuates a reed switch (sealed contact) inside the guide tube.

This closes the electric circuit that transmits the alarm signal to the control board.

Am I using the wrong approach towards this question or is there another way/library which i could use to get to the desired answer.

1

There are 1 answers

0
Tricotou On

Following my comment, after a closer look to the documentation of difflib it appears than difflib.ndiff wants you to split lines yourself. It means in your case, in order to compute per word, you would replace spaces by a new line by using the .split function

After your :

gold_content = gold_file.read()
predicted_content = predicted_file.read()

Just add :

gold_content = gold_content.split()
predicted_content = predicted_content.split()

And now the final result makes much more sense :

Added Text:
<p>10</p><p>3GZF500725-144EN03-2023|ABBIECLVMOTORS</p><p>Formanufacturer.</p><p>TheK.</p><p>Thewater-cooledconstructions.</p><p>Ifrate.</p><p>9.2.water</p><p>Whencon-nected.</p><p>Emptyingchecked.</p><p>10.holes</p><p>Itlo-cateddown-wards,standard.</p><p>11.detector</p><p>Thispro-ducedtube.</p><p>Thisboard.</p>

Deleted Text:
10Formanufacturer.TheK.Thewater-cooledconstructions.Ifrate.###9.2.waterWhenconnected.Emptyingchecked.##10.holesItlocateddownwards,standard.##11.detectorThisproducedtube.Thisboard.3GZF500725-144EN03-2023|ABBIECLVMOTORS