P.S. Just think that the content of the file is n (with nothing else after or before; let's keep it as simple as possible). readCharacter() would return the correct decoded 'n' character, but it would also have reached the end of the file. So getTagContent() method would return the empty string, which is not the case.
P.S. 2 I found a solution, but it doesn't look really neat in my opinion. The if in the while loop in the getTagContentLength() method may look like this:
if (ch == '<' || is.eof())
{
if (ch != EOF && ch != '<')
{
tagContent[i++] = ch;
}
break;
}
I am trying to achieve the following:
We have an HTML tag content, e.g. let the tag be <th>some value</th>.
When I invoke the method getTagContent(), is.get() would return the 's' symbol, so the first character of the content (I have handled that).
What I would want to be able to do as well, is to handle character entity references, so some value can be written as some value or some value. That's what the readCharacter() method is for.
char* getTagContent(std::istream& is, int maxTagContentLength)
{
char* tagContent = new char[maxTagContentLength + 1];
int i = 0;
char ch;
while (true)
{
ch = readCharacter(is);
if (ch == '<' || is.eof())
{
break;
}
tagContent[i++] = ch;
}
tagContent[i] = '\0';
return tagContent;
}
char readCharacter(std::istream& is)
{
char ch = is.get();
if (ch == '&' && is.peek() == '#')
{
is.get();
char charEntityRef;
int number = 0;
while (true)
{
charEntityRef= is.get();
if (is.eof())
{
break;
}
if (!isDigit(charEntityRef))
{
is.unget();
break;
}
number = number * 10 + charEntityRef- '0';
}
ch = (char)(number);
}
return ch;
}
I came across some problems though. Imagine we have the following content nineteen which is the string nineteen. My code would return the string ninetee without the last n. The problem is that in the last iteration of the while loop in the getTagContent() method, the character would actually be exactly the last 'n' that's missing in the result, but the eof bit is raised in the readCharacter() method and it won't be written to the result (we will exit the loop because of the break statement).
I don't see how to fix it without messing up the logic (e.g. we need to stop exactly when we meet an opening tag, as that's when the tag content ends, and probably the closing tag follows).
There are many problems with your code:
you are not handling EOF correctly.
you are not handling the terminating
;at the end of an entity correctly. It is part of the entity and should not be put back into the input stream.you are handling only entities that are decimal codes, but not entities that are hex codes or names.
you have a buffer overflow in
getTagContent()if the content is more thanmaxTagContentLengthcharacters in length.getTagContent()will break prematurely if the content contains a entity for'<'(like<). You need to check if a read character is the terminating'<'at the end of the content before you check for any entities in the content.With that said, try something more like this:
That being said, this is not a good way to parse HTML. You really should be using an actual HTML parser library. But if you can't/wont, then at least read the HTML into a larger membory buffer that you can tokenize better instead of processing 1 char at a time.