Trying to grab information in Child Link using WebHarvest

143 views Asked by At

I would like to grab the information of each child link, but the program shows error. Below are my full config file. The error is Caused by: org.xml.sax.SAXParseException; lineNumber: 6; columnNumber: 724; Element type "t.length" must be followed by either attribute specifications, ">" or "/>".**

<?xml version="1.0" encoding="UTF-8"?>

<config>
    <var-def name="webpage">
        <html-to-xml>
            <http url="http://www.thestar.com.my/business/" />
        </html-to-xml>
    </var-def>

    <loop item="TheStarBiz"  index="i">
        <list>
            <xpath expression="//div[@class='nine columns mobile3']">
                <var name="webpage"></var>
            </xpath>
        </list>

        <body>
            <var-def name="title">
                <xpath expression="(//p[@class='m'])/a/text()">
                    <var name="TheStarBiz"></var>
                </xpath>
            </var-def>

            <var-def name="link">
                <xpath expression="//p[@class='m']/a/@href">
                    <var name="TheStarBiz"></var>
                </xpath>
            </var-def>



              <var-def name="new_url">
                <xquery>
                    <xq-param name="TheStarBiz"><var name="TheStarBiz"/></xq-param>
                    <xq-expression><![CDATA[
                           declare variable $TheStarBiz as node() external; 
                            let $url := data($TheStarBiz//p[@class='m']/a/@href) 
                                return
                               $url
                    ]]></xq-expression>
                </xquery>
                </var-def>

            <var-def name="new_page_content">
                    <http url="${new_url}"/>
                </var-def>                  



            <var-def name="fulldesc">
                <xpath expression="//div[@class='story']"> 
                    <var name="new_page_content"/>
                </xpath>
            </var-def>


            <var-def name="textfile">
                <file action="append" type="text" path="C:\Users\jacey\Desktop\WebHarvest\test.txt">
                    <template>
                        ${title} ${sys.cr}${sys.lf} 
                        ${link} ${sys.cr}${sys.lf} 
                        ${new_page_content} ${sys.cr}${sys.lf}
                    </template>
                </file>
            </var-def>
        </body>
    </loop>
</config>
1

There are 1 answers

0
xdhmoore On

For those who come after:

I had almost the same error and it was caused by a snippet of javascript in the file being parsed:

blah...for(var o=0;o<t.length;o++)...blah

In hindsight I suppose it's kindof obvious. In our case, this was because the endpoint was no longer returning XML but HTML. If the desired file actually has javascript, you might add CDATA tags around your js like so:

<script>
    /* <![CDATA[ */
        console.log(myJavaScriptCode < theBest);
    /* ]]> */
</script>`