Convert HTML table into plain text using Lex and Yacc

1k views Asked by At

I have a an HTML table code, which needs to be converted into plain text, using the Flex utility in Linux systems.
I've come up with a list of tokens in my .lex file, which are as follows:

    OPENTABLE       <table>
    CLOSETABLE      </table>
    OPENROW         <tr>
    CLOSEROW        </tr>
    OPENHEADING     <th>
    CLOSEHEADING    </th>
    OPENDATA        <td>
    CLOSEDATA       </td>
    STRING          [0-9a-zA-Z]*
    %%
    %%

My CGF (Translation Scheme included) for the HTML parse looks like:

    TABLE     -->   OPENTABLE ROWLIST CLOSETABLE    ;
    ROWLIST   -->   ROWLIST ROW | ^                 ;
    ROW       -->   OPENROW DATALIST CLOSEROW       printf("\n");
    DATALIST  -->   DATALIST DATA | ^               ;
    DATA      -->   OPENDATA STRIN CLOSEDATA        printf(yytext+"\t");

I've seen some examples, but I'm not getting what should I write in the rules section of my .lex file.

1

There are 1 answers

0
Syed Ali Hamza On BEST ANSWER

I spent some time on the basics, and figured it out. Flex' info page was of great help. This is what the required file is. Works good, but still needs to improvements.

%{
#include <string.h>
char *substring(char* str)
    {
        int i = 0;
        int l = strlen(str);
        char *str2;
        str2 = malloc(l+1);
        for (i=4; i < l-5;i++)
        {
            str2[i-4] = str[i];
        }
        return str2;
    }
%}
OPENTABLE "<table>"
CLOSETABLE "</table>"
OPENROW "<tr>"
CLOSEROW "</tr>"
OPENHEADING "<th>"
CLOSEHEADING "</th>"
OPENDATA "<td>"
CLOSEDATA "</td>"
STRING [a-zA-Z0-9]*
%%
{OPENDATA}.{STRING}.{CLOSEDATA} printf("%s\t", substring(yytext));
{OPENHEADING}.{STRING}.{CLOSEHEADING} printf("%s\t", substring(yytext));
{CLOSEROW} printf("\n");
. ;
[ \n\t] ;
%%
int main(int argc, char** argv)
{
    ++argv, --argc;
    yyin = fopen(argv[0], "r");
    yylex();
}