Python |S1 vector to string

1.2k views Asked by At

i have a vector "char" of type |S1 like in the example below:

masked_array(data=[b'E', b'U', b'3', b'7', b'6', b'8', b' ', b' ', b' ', b' '],
             mask=False,
       fill_value=b'N/A',
            dtype='|S1')

I want to get the string in it, in this example 'EU3768'

This example is taken from a netcdf file. Library used is netCDF4.

Further question: Why is there a b in front of all single letters?

Thanks for your help :)

2

There are 2 answers

8
itprorh66 On

First of all let's answer the most basic question: What is the meaning of the b in front of each letter. The b simply indicates that the character string is in bytes. The internal format of the data is being stored encoded as utf-8. So to convert it back to a string it must be decoded. So with that as a preamble, the following code will do the trick.

I am assuming that you can extract data from the masked_array. Then perform the following operations:

#  Convert the list of bytes to a list of strings
ds = list(map(lambda x: x.decode('utf-8'), data))

#  Covert List of strings to a String and strip any trailing spaces
sd = ''.join(ds).strip()

This could of course be performed in a single line of code as follows:

sd = ''.join(list(map(lambda x: x.decode('utf-8'), data))).strip()
0
Sam Mason On

as an answer to your follow-up question, you might be able to let Numpy do some of the work by just working with the underlying bytes. for example, I can create a large number of similar shaped objects via:

import numpy as np
from string import ascii_letters, digits

letters = np.array(list(ascii_letters + digits), dtype='S1')

v = np.random.choice(letters, (100_000, 10))

The first three elements of this look like:

[[b'W' b'B' b'W' b'4' b'O' b'B' b'A' b'4' b'Q' b'n']
 [b'I' b'I' b'T' b'u' b'K' b'K' b'U' b'a' b'r' b'r']
 [b'V' b'f' b'n' b'U' b'G' b'0' b'j' b'R' b'm' b'C']]

I can then convert these back to strings via some byte level shanigans:

[bytes.decode(s) for s in np.frombuffer(v, dtype='S10')]

The first three look like:

['WBW4OBA4Qn', 'IITuKKUarr', 'VfnUG0jRmC']

which hopefully makes sense. This takes ~20ms which is quicker than a version which goes through Python:

[b''.join(r).decode() for r in v]

taking ~200ms. This is still much faster than the version of code you posted, so maybe you could be accessing netcdf more efficiently.