my question is ¿What would be the best technology to detect hierarchical or tree patterns?
I want to recognise parts in a HTML page, for example: user login menu, or navigation menu, or content body, footer, etc.
I'm trying with a grammar recognition implemented by me (I dont like classical like Lex, yacc for this work, because they don't care the HTML data sense) with php and using DOM parser for HTML walking (DOMDocument).
I'm having trouble because the variability in the manner to represent visually the data in html. For example, a menu can be implemented with <ul><li><a href=#>Link1</a><li>Link2....</ul>
, but there is only one possibility of hundreds. It depends also in css events (onclick, onmousehover). And there are problems in recognising a real menu from a fake menu.
I was thinking in neural training but in all examples I found they are for linear data, not hierarchical data. I tried train some networks but it's obvious that they lose the relationship information between the DOM tree elements. Or maybe I don't know to make it better.
My pattern recognition grammar has a poor result because it doesn't accept possible "accidents" in html nor smooths the recognition, it's too strict (not fuzzy).
¿Any idea?
One possible way would be to have an array of many (10-20) different regexps or other detection methods, and see how many of them it qualifies for, weight them according to how often each is correct, and compare it to a value. or you could take the total correct anduse a nueral network to choose, if you like them.