How to search for a string pattern inside html, coding in C?

165 views Asked by At

I need to search for titles(string) inside a html file. For this, i did the strstr to get the tag "li" which contains the tag "title= \", which contains the string that i want.

For example: using this array below, i need to get the name of the book, inside title. However, i need all the titles inside the html body, which has hundreds.

<li><i><a href="/wiki/Animal_Farm" title="Animal Farm">A Revolução dos Bichos</a></i> (<a href="/wiki/1945" title="1945">1945</a>), de <a href="/wiki/George_Orwell" title="George Orwell">George Orwell</a>.</li>

I was trying to run a "for" using strlen to get its loop condition (line length). Inside this for, i used strstr to get the title=" to finally copy the string until the end of the quotation marks.

something like this:

for (i=0, i<len, i++){
    if(strstr(array[i] == " title=\""){
        do{
    temp[i] = array[i];
          }while((strcmp(array[i], "\""));
    }
}

That's the point i struggled with. How to get strings, inside strings, using patterns(conditions)? Any suggestions?

Thank you in advance! Regards.

1

There are 1 answers

0
Jongware On BEST ANSWER

HTML parsing "the right way" is way more complicated than checking for one string at a time. My code below does more things not right than the other way around -- but part of this is due to a lack of information.

Is your HTML well-formed? Can the title attribute contain the strings li or title, or stray < or > characters? Do you need to take into account that spaces may occur inside tags, such as < li >? Are all attributes written with double quotes ", or can there be single quotes ' as well?

My code shows the general idea of HTML parsing: hop from one < to the next and inspect the HTML command that follows it. But as you can see, it's ugly as hell and, while it "does the job", it's nigh on unmaintanable.

For a quick rush job within well defined parameters, it'll probably do; for all others, look for a general HTML parsing library, which will shield you from the caveats mentioned above and provide a user-friendly interface to elements and attributes.

#include <stdio.h>
#include <string.h>
#include <ctype.h>

int main()
{
    char str[] = "<li><i><a href=\"/wiki/Animal_Farm\" title=\"Animal Farm\">A Revolução dos Bichos</a></i> (<a href=\"/wiki/1945\" title=\"1945\">1945</a>), de <a href=\"/wiki/George_Orwell\" title=\"George Orwell\">George Orwell</a>.</li>"
                "<li><i><a href=\"/wiki/Animal_Farm_II\" title=\"Animal Farm II: Return of the Hog\">A Revolução dos Bichos</a></i> (<a href=\"/wiki/1945\" title=\"1945\">1945</a>), de <a href=\"/wiki/George_Orwell\" title=\"George Orwell\">George Orwell</a>.</li>";
    char *html_walker;
    html_walker = str;
    do
    {
        html_walker = strstr(html_walker, "<");
        if (!html_walker)
            break;
        /* Is this "LI"? */
        if (!strncasecmp(html_walker+1, "LI", 2) &&
            !isalnum(html_walker[3]))
        {
            /* Yes. Scan following HTML entries for 'title' until we find an "</LI>" */
            do
            {
                /* an "</LI>" code. Bye. */
                if (*html_walker == '<')
                {
                    html_walker++;
                    if (!strncasecmp(html_walker+1, "/LI", 3) &&
                        !isalnum(html_walker[4]))
                    {
                        while (*html_walker && *html_walker != '>')
                            html_walker++;
                        if (*html_walker == '>')
                            html_walker++;
                        break;
                    }
                    /* Not an "</LI>" code. Look for 'title' */
                    while (*html_walker && *html_walker != '>')
                    {
                        if (isspace (*html_walker) &&
                            !strncasecmp(html_walker+1, "TITLE=\"", 7))
                        {
                            html_walker += 8;
                            printf ("title [");
                            while (*html_walker && *html_walker != '"')
                            {
                                printf ("%c", *html_walker);
                                html_walker++;
                            }
                            printf ("]\n"); fflush (stdout);
                            /* We found a title, so skip to next </LI> */
                            do
                            {
                                html_walker = strstr(html_walker, "<");
                                if (!html_walker)
                                    break;
                                /* Is this "/LI"? */
                                if (!strncasecmp(html_walker+1, "/LI", 3) &&
                                    !isalnum(html_walker[4]))
                                    break;
                                html_walker++;
                            } while (html_walker && *html_walker);
                            break;
                        }
                        html_walker++;
                    }
                    if (*html_walker == '>')
                        html_walker++;
                } else
                {
                    html_walker++;
                }
            } while (*html_walker);
        } else
        {
            /* Skip forward to next '<' */
            html_walker++;
        }
    } while (html_walker && *html_walker);
    return 0;
}