Lexical analyser : how to identify the end of a token

449 views Asked by At

I need a function that identifies the end of token so that i can save in it an array and send it to my automata for identification(Operator,Keyword,Identifiers)

the automata is working great when i enter only 1 token , but when there is `lots of tokens including spaces it doesn't work , i need this function to remove spaces and stops at the end of each token and send each token in array to my automata function, i'am stuck..

I'am using C

ex: ABC + D

: ABC token 1

: + token 2

: D token 3

ex2: ABC++D12*/z (ABC,+,+,D12,*,/,z) 7 tokens ex3: AD ++ - C (AD,+,+,-,C) 5 tokens

edit: i'am not using any tool , only c with Deterministic finite automaton

2

There are 2 answers

0
Sierra On BEST ANSWER
void lirelexeme(char chaine[500]){
int i,j=0,k;
char tc,tc2;               
char lexeme[500];memset(lexeme,0,500);

for(i=0;i<length;i++){
tc=chaine[i]; // terme courant
tc2=chaine[i+1]; // terme suivant

if(tc!=' ' && tc!='\0' && tc!='\n'&& tc!='\t'){

if((tc==':' && tc2=='=') || (tc=='>' && tc2=='=') || (tc=='<' && tc2=='=') || (tc=='<' && tc2=='>')){  // ex: a:= / >= / <=
lexeme[0]=tc;
lexeme[1]=tc2;
lex(lexeme);
memset(lexeme,0,500);
j=0;    // préparer pour recevoir le nouveau lexeme
i++;    // on évite de prendre tc2
}

here is the function that will split the tokens , use puts() instead of lex() to see the result

note : lex() is lexical analyser function i made, that will take token as argument and give you as return its type ( constant , identifier , keyword , arithmetique operator , logical op...)

0
Malcolm McLean On

Assume comments are stripped in an earlier pass.

Now you hit either whitespace, a letter, a numeral, or a punctuation character.

Whitespace either isn't a token or is a dummy / null token the parser ignores.

A letter must be part of an identifier. This consists of a letter (or underscore, small curveball there) followed by zero or more letters or numerals. Whitespace or punctuation other than underscore terminates that token.

A numeral must be part of a number. The rules are a bit complex, preceding 0 means ocatal (obsolete), preceding 0x means hexadecimal, 1-9 means decimal. Suffixes are allowed as is scientific notation. But arbitrary [punctuation or whitespace terminates the numeral.

There are little fiddly rules for unary -, ++, <=, += and other compounds. Bu these tokens don't have values attached to them. ++ is always ++.

Strings are the next big problem, because quotes can be escaped.

But that's about it. It's not that hard to hand build a lexer for C source.

(See MiniBasic to understand how to write a simple but fully featured recursive descent parser for a simple language. https://sourceforge.net/projects/minibasic/files/?source=navbar )