I want to collect pictures from Google image search. However, I am constantly notified with an error.
For example, the URL https://www.google.com/search?q=banana&hl=en&gws_rd=ssl&tbm=isch is fine in my browser, but in web harvest it reports that: the reference to entity "gws_rd" must end with the ';' delimiter.
I guess '&' is a special character in webharvest, but I cannot find information about it. Can you figure out why?
This is the code:
<var-def name="search" overwrite="false">banana</var-def>
<var-def name="url"><template>http://images.google.com/images?q=${search}&hl=en</template></var-def>
<var-def name="xml">
<html-to-xml>
<http url="${url}"/>
</html-to-xml>
</var-def>
<var-def name="largeImgUrl">
<xpath expression="//*[@id='irc_cc']/div[4]/div[1]/div/div[2]/div[1]/a/img">
<var name="xml"/>
</xpath>
</var-def>
from experience you will need to first store the url in a variable, and then refer to the variable from within the http processor call
EDIT
I notice you have pasted your code. Good.
1) remember that all the webharvest config files are written in XML, and amersand & is a special character in XML, as it is part of the entity declaration
In webharvest i normaly avoid this issue by using CDATA sections within
<template>or<code>blocks.2)when using webharvest graphical interface, you can easily debug your xpath expressions. Run your code as normal, and then on the toolbar at the top click the icon with a magniffying glass. Then choose "xml" (name of your variable you have set). This will open a new window, with a preview of your xml. Make sure the "view as" dropdown is set to xml.
You should now have a "xpath expression" box where you can test your xpath.
3)I strongly discourage from writing xpaths referring to numbered elements. (eg
div[4]/div[1]/div/div[2]/div[1]/). Any small change in the underlying page usually breaks the code. It is much better to select elements based on id or other properties.