I need to search for titles(string) inside a html file. For this, i did the strstr to get the tag "li" which contains the tag "title= \", which contains the string that i want.
For example: using this array below, i need to get the name of the book, inside title. However, i need all the titles inside the html body, which has hundreds.
<li><i><a href="/wiki/Animal_Farm" title="Animal Farm">A Revolução dos Bichos</a></i> (<a href="/wiki/1945" title="1945">1945</a>), de <a href="/wiki/George_Orwell" title="George Orwell">George Orwell</a>.</li>
I was trying to run a "for" using strlen to get its loop condition (line length). Inside this for, i used strstr to get the title=" to finally copy the string until the end of the quotation marks.
something like this:
for (i=0, i<len, i++){
if(strstr(array[i] == " title=\""){
do{
temp[i] = array[i];
}while((strcmp(array[i], "\""));
}
}
That's the point i struggled with. How to get strings, inside strings, using patterns(conditions)? Any suggestions?
Thank you in advance! Regards.
HTML parsing "the right way" is way more complicated than checking for one string at a time. My code below does more things not right than the other way around -- but part of this is due to a lack of information.
Is your HTML well-formed? Can the
title
attribute contain the stringsli
ortitle
, or stray<
or>
characters? Do you need to take into account that spaces may occur inside tags, such as< li >
? Are all attributes written with double quotes"
, or can there be single quotes'
as well?My code shows the general idea of HTML parsing: hop from one
<
to the next and inspect the HTML command that follows it. But as you can see, it's ugly as hell and, while it "does the job", it's nigh on unmaintanable.For a quick rush job within well defined parameters, it'll probably do; for all others, look for a general HTML parsing library, which will shield you from the caveats mentioned above and provide a user-friendly interface to elements and attributes.