Does PVS-Studio know about Unicode chars?

317 views Asked by At

This code produces Medium warnings at lines w/ return:

// Checks if the symbol defines two-symbols Unicode sequence
bool doubleSymbol(const char c) {
    static const char TWO_SYMBOLS_MASK = 0b110;
    return (c >> 5) == TWO_SYMBOLS_MASK;
}

// Checks if the symbol defines three-symbols Unicode sequence
bool tripleSymbol(const char c) {
    static const char THREE_SYMBOLS_MASK = 0b1110;
    return (c >> 4) == THREE_SYMBOLS_MASK;
}

// Checks if the symbol defines four-symbols Unicode sequence
bool quadrupleSymbol(const char c) {
    static const char FOUR_SYMBOLS_MASK = 0b11110;
    return (c >> 3) == FOUR_SYMBOLS_MASK;
}

PVS says that the expressions are always false (V547), but they actually aren't: char may be a part of Unicode symbol that is read to std::string! Here is the Unicode representation of symbols:
1 byte - 0xxx'xxxx - 7 bits
2 bytes - 110x'xxxx 10xx'xxxx - 11 bits
3 bytes - 1110'xxxx 10xx'xxxx 10xx'xxxx - 16 bits
4 bytes - 1111'0xxx 10xx'xxxx 10xx'xxxx 10xx'xxxx - 21 bits

The following code counts number of symbols in a Unicode text:

size_t symbolCount = 0;

std::string s;
while (getline(std::cin, s)) {
    for (size_t i = 0; i < s.size(); ++i) {
        const char c = s[i];
        ++symbolCount;
        if (doubleSymbol(c)) {
            i += 1;
        } else if (tripleSymbol(c)) {
            i += 2;
        } else if (quadrupleSymbol(c)) {
            i += 3;
        }
    }
}

std::cout << symbolCount << "\n";

For the Hello! input the output is 6 and for Привет, мир! is 12 — this is right!

Am I wrong or doesn't PVS know something? ;)

1

There are 1 answers

0
AndreyKarpov On BEST ANSWER

PVS-Studio analyzer knows that there are signed and unsigned char types. Whether signed/unsigned is used depends on compilation keys and PVS-Studio analyzer takes these keys into account.

I think this code is compiled, when char is of signed char type. Let's see what consequences it brings.

Let’s look only at the first case:

bool doubleSymbol(const char c) {
    static const char TWO_SYMBOLS_MASK = 0b110;
    return (c >> 5) == TWO_SYMBOLS_MASK;
}

If the value variable 'c' is less than or equal to 01111111, the condition will always be false, because during the shift the max value you can get is 011.

It means we are interested in only cases where the highest bit in the variable 'c' is equal to 1. As this variable is of signed char type, then the highest bit means that the variable stores a negative value. Before the shift, signed char becomes a signed int and the value continues to be negative.

Now let's see what the standard says about the right-shift of negative numbers:

The value of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a non-negative value, the value of the result is the integral part of the quotient of E1/2^E2. If E1 has a signed type and a negative value, the resulting value is implementation-defined.

Thus, the shift of a negative number to the left is implementation-defined. This means that the highest bits are filled with nulls or ones. Both will be correct.

PVS-Studio thinks that the highest bits are filled with ones. It has a full right to think so, because it is necessary to choose any implementation. So it turns out that the expression ((c) >> 5) will have a negative value if the highest bit in the variable 'c' is originally equal to 1. A negative number cannot be equal to TWO_SYMBOLS_MASK.

It turns out that from the viewpoint of PVS-Studio, the condition will always be false, and it correctly issues a warning V547.

In practice, the compiler may behave differently: the highest bits will be filled with 0 and then everything will work correctly.

In any case, it is necessary to fix the code, as it goes to the implementation-defined behavior of the compiler.

Code might be fixed as follows:

bool doubleSymbol(const unsigned char c) {
    static const char TWO_SYMBOLS_MASK = 0b110;
    return (c >> 5) == TWO_SYMBOLS_MASK;
}