How can I determine the index of the same set of characters between two strings that are of different lengths?

Question

How can I determine the index of the same set of characters between two strings that are of different lengths?

109 views Asked by Douglas Gaskell At 08 June 2015 at 20:31

I apologize up front for the title, I'm not sure how to word the question.

I am trying to find the index for a similar character or set of characters in two different, but similar strings.

String A: I <color=red><b>really</b></color> don't like spiders!
String B: I really don't like spiders!

The relevant text is the same, however A has some formatting while B does not. I got B by taking A and running a regex to find and replace all <contents> with an empty string.

Now lets say I have selected a character at an index of 9 in B, this would be the letter d in the word don't. How can I then determine in string A that the letter d in don't needs to also be selected which is at an index of 35 (if I counted correctly)?

Edit: Possibly important information, these tags are for the rich text within Unity. Very similar to HTML in almost all regards.

Original Q&A

There are 1 answers

**poke** · Accepted Answer · 2015-06-08T21:20:03+00:00

As I already suggested in the comments, you should write your own parser for this format that keeps the formatting as metadata next to the text. For example, you could keep a simple list of string parts where each part represents consecutive text with the same formatting.

You could start with something simplistic as this:

import re

def parse (string):
    it = iter([None] + re.split('(<[^>]+>)', string))

    parsed = []
    curFormat = {}
    for fmt, text in zip(it, it):
        if fmt is None:
            curFormat = {}
        elif fmt.startswith('</'):
            fmt = fmt[2:-1]
            del curFormat[fmt]
        else:
            fmt = fmt[1:-1]
            if '=' in fmt:
                name, value = fmt.split('=', 1)
                curFormat[name] = value
            else:
                curFormat[fmt] = True

        if text != '':
            parsed.append((text, list(curFormat.items())))

    return parsed

For your text, this will give you the following result:

>>> text = "I <color=red><b>really</b></color> don't like spiders!"
>>> parsed = parse(text)
>>> parsed
[('I ', []), ('really', [('color', 'red'), ('b', True)]), (" don't like spiders!", [])]

As you can see, you get pairs of text, with a list of formatting information for that particular part of text. If you then want to get the underlying text, you can just iterate the first list elements:

>>> ''.join(t for t, fmt in parsed)
"I really don't like spiders!"

And on top of that, you can also create your own indexing method (note that this one is really crude):

def index (parsed, start, length):
    output = ''
    for t, fmt in parsed:
        if start < 0:
            output += t
        elif start > len(t):
            start -= len(t)
        else:
            output += t[start:]
            start = -1
        if len(output) > length:
            return output[:length]
    return output

>>> index(parsed, 4, 5)
'ally '
>>> index(parsed, 7, 6)
"y don'"

Finally, you can put this all inside a custom type, which implements the iterator protocol and the senquence protocol, so you can use it like a normal string.

TechQA.

How can I determine the index of the same set of characters between two strings that are of different lengths?

There are 1 answers

Related Questions in C#

Related Questions in ARRAYS

Related Questions in REGEX

Related Questions in STRING

Popular Questions

Popular Tags

Trending Questions