How can I determine the index of the same set of characters between two strings that are of different lengths?

111 views Asked by At

I apologize up front for the title, I'm not sure how to word the question.

I am trying to find the index for a similar character or set of characters in two different, but similar strings.

  • String A: I <color=red><b>really</b></color> don't like spiders!
  • String B: I really don't like spiders!

The relevant text is the same, however A has some formatting while B does not. I got B by taking A and running a regex to find and replace all <contents> with an empty string.

Now lets say I have selected a character at an index of 9 in B, this would be the letter d in the word don't. How can I then determine in string A that the letter d in don't needs to also be selected which is at an index of 35 (if I counted correctly)?

Edit: Possibly important information, these tags are for the rich text within Unity. Very similar to HTML in almost all regards.

1

There are 1 answers

1
poke On BEST ANSWER

As I already suggested in the comments, you should write your own parser for this format that keeps the formatting as metadata next to the text. For example, you could keep a simple list of string parts where each part represents consecutive text with the same formatting.

You could start with something simplistic as this:

import re

def parse (string):
    it = iter([None] + re.split('(<[^>]+>)', string))

    parsed = []
    curFormat = {}
    for fmt, text in zip(it, it):
        if fmt is None:
            curFormat = {}
        elif fmt.startswith('</'):
            fmt = fmt[2:-1]
            del curFormat[fmt]
        else:
            fmt = fmt[1:-1]
            if '=' in fmt:
                name, value = fmt.split('=', 1)
                curFormat[name] = value
            else:
                curFormat[fmt] = True

        if text != '':
            parsed.append((text, list(curFormat.items())))

    return parsed

For your text, this will give you the following result:

>>> text = "I <color=red><b>really</b></color> don't like spiders!"
>>> parsed = parse(text)
>>> parsed
[('I ', []), ('really', [('color', 'red'), ('b', True)]), (" don't like spiders!", [])]

As you can see, you get pairs of text, with a list of formatting information for that particular part of text. If you then want to get the underlying text, you can just iterate the first list elements:

>>> ''.join(t for t, fmt in parsed)
"I really don't like spiders!"

And on top of that, you can also create your own indexing method (note that this one is really crude):

def index (parsed, start, length):
    output = ''
    for t, fmt in parsed:
        if start < 0:
            output += t
        elif start > len(t):
            start -= len(t)
        else:
            output += t[start:]
            start = -1
        if len(output) > length:
            return output[:length]
    return output
>>> index(parsed, 4, 5)
'ally '
>>> index(parsed, 7, 6)
"y don'"

Finally, you can put this all inside a custom type, which implements the iterator protocol and the senquence protocol, so you can use it like a normal string.