C++ Avoiding down-casting or variants

1.2k views Asked by At

I've been facing a design issue for a while :

  • I am parsing a source code string into a 1-dimensional array of token objects.

  • Depending on the type of a token (litteral, symbol, identifier), it has some token-type-specific data. Litterals have a value, symbols have a symbol type, and identifiers have a name.

  • I'm then building an abstract representation of the script defined in that source code string by analyzing this 1 dimensionnal array of tokens. The syntax analyzing logic is done outside of these token objects.

My problem is that I need all my tokens, no matter their type, to be stored into a single array, because it seems easier to analyze and because i don't see any other way to do it. This involves having a common type for all different token types, by either creating a class hierarchy :

class token { token_type type; };
class identifier : public token { string name; };
class litteral : public token { value val; };
class symbol : public token( symbol_type sym; };

... or by creating a variant :

class token
{
   token_type type;
   string name; // Only used when it is an identifier
   value val; // Only used when it is a litteral
   symbol_type sym; // Only used when it is a symbol
};

The class hierarchy would be used as follows :

// Iterator over token array
for( auto cur_tok : tokens )
{
    // Do token-type-specific things depending on its type
    if( cur_token->type == E_SYMBOL )
    {
        switch( ((symbol *) cur_token)->symbol_type )
        {
            // etc
        }
    }
}

But it has several problems :

  • The base token class has to know about it's subclasses, which seems wrong.

  • It involves down casting to access specific data depending on the type of a token, which i was told is wrong too.

The variant solution would be used in a similar way, without down-casting :

for( auto cur_token: tokens )
{
    if( cur_token->type == E_SYMBOL )
    {
        switch( cur_token->symbol_type )
        {
            // etc
        }
    }
}

The problem of that second solution is that it mixes everything into a single class, which doesn't seem very clean to me, as there are unused variables depending on the type of the token, and because a class should represent a single "thing" type.

Would you have another possibility to suggest to design this ? I was told about the visitor pattern, but i can't imagine how i would use it in my case.

I would like to keep the possibility of iterating over an array because i might have to iterate in both directions, from a random position and maybe multiple times.

Thank you.

1

There are 1 answers

6
Tony Delroy On BEST ANSWER

Option 1: "fat" type with some shared / some dedicated fields

Pick a set of data members that can be repurposed in a token-type specific way for your "some token-type-specific data. Literals have a value, symbols have a symbol type, and identifiers have a name."

 struct Token
 {
    enum Type { Literal, Symbol, Identifier } type_;

    // fields repurposed per Token-Type
    std::string s_;  // text of literal or symbol or identifier

    // fields specific to one Token-Type
    enum Symbol_Id { A, B, C } symbol_id_;
 };

A problem with this is that the names of shared fields may be overly vague so that they're not actively misleading for any given Token Type, while the "specific" fields are still accessible and subject to abuse when the Token is of another type.

Option 2: discriminated union - preferably nicely packaged for you ala boost::variant<>:

struct Symbol { ... };
struct Identifier { ... };
struct Literal { ... };
typedef boost::variant<Symbol, Identifier, Literal> Token;
std::list<Token> tokens;

See the tutorial for data retrieval options.

Option 3: OOD - classic Object Oriented approach:

Almost what you had, but crucially the Token type needs a virtual destructor.

struct Token { virtual ~Token(); };
struct Identifier : Token { string name; };
struct Literal : Token { value val; };
struct Symbol : Token { symbol_type sym; };

std::vector<std::unique_ptr<Token>> tokens_;
tokens_.emplace_back(new Identifier { the_name });

You don't need a "type" field as you can use C++'s RTTI to check whether a specific Token* addresses a specific derived type:

if (Literal* p = dynamic_cast<Literal>(tokens_[0].get()))
    ...it's a literal, can use p->val; ...

You're concerns were:

•The base token class has to know about it's subclasses, which seems wrong.

Not necessary given RTTI.

•It involves down casting to access specific data depending on the type of a token, which i was told is wrong too.

It often is, as often in OO it's practical and desirable to create a base class API expressing a set of logical operations that the entire hierarchy can implement, but in your case there might be a need for a "fat" interface (which means - lots of operations that - if they were in the API - would be confusing no-ops (i.e. do nothing) or somehow (e.g. return value, exceptions) report that many operations were not supported. For example, getting the symbol type classification isn't meaningful for non-symbols. Making it accessible only after a dynamic_cast is a little better than having it always accessible but only sometimes meaningful, as in "option 1", because after the cast there's compile-time checking of the usage.