Forward slash "/" in string converted to "&#47", is that platform independent behaviour?

377 views Asked by At

I have a Python script that reads an html into lines, and then filters out the relevant lines before saving those lines back as html file. I had some problems till I figured out that a / in the page text was being converted to &#47 when saved as a string.

The source html that I'm parsing through has the following line:

<h3 style="text-align:left">SYDNEY/KINGSFORD SMITH (YSSY)</h3>

which when passing through the file.readlines() would come out as:

<h3 style='text-align:left'>SYDNEY&#47BANKSTOWN (YSBK)</h3>

which then trips up the beautifulsoup because that then gets confused with the "&" symbol tripping up all subsequent tags.

What I'm interested in is to know if this replacement value "&#47" is platform independent or not?

It's not hard to run a .replace prior to saving each string, avoiding the issue now that I'm coding and testing on windows, but will it still work if I deploy my script on a linux server?

Here's what I have now, which works fine when run under windows:

def getHTML(self,html_source):
    with open(html_source, 'r') as file:
        source_lines = file.readlines()
    relevant = False
    relevant_lines = []
    for line in source_lines:
        if "</table>" in line:
            relevant = False
        if self.airport in line:
            relevant = True
        if relevant:
            line = line.replace("&#47", " ")
            relevant_lines.append(line)
    relevant_lines.append("</table>")
    filename = f"{html_source[:-5]}_{self.airport}.html"
    with open(filename, 'w') as file:
        file.writelines(relevant_lines)
    with open(filename, 'r') as file:
        relevant_html = file.read()
    return relevant_html

Can anyone tell me, without having to install a virtual machine with linux, if this will work cross-platform? I tried to look for documentation on this, but all I could find was about ways to explicitly escape a / when entering a string, nothing documenting how to deal with / or other invalid characters being read when reading a source file into strings.

1

There are 1 answers

1
Marcel Preda On BEST ANSWER

It should be OK everywhere, it is a standard. See https://www.w3schools.com/charsets/ref_html_ascii.asp