I have a script that extracts the urls from a webpage, loads each link and extracts the data I require. However upon further investigation I find that Selenium isn't opening the link as it should and is duplicating information. eg as follows:
link 1 - Title:ABC <-- within link, extract 123 link 2 - Title:DEF <-- within link, extract 456 link 3 - Title:GHI <-- within link, extract 789 link 4 - Title:JKL <-- within link, extract 000
the output should be as follows:
ABC, 123 DEF, 456 GHI, 789 JKL, 000
however the output I get is as follows:
ABC, 123 ABC, 123 GHI, 789 JKL, 000
this behavior seems to be random.
Here is the code
elems = driver.find_elements_by_xpath(alllinks) for elem in elems: links.append(elem.get_attribute("href")) for url in links: try: time.sleep(0.5) driver.get(url) time.sleep(2)
anyone experience this type of behavior ?
An update on this, I have scraped just the URL's 3 times, and compared the results with each other. The URLs are unique and extracted according to the site. From what I see, its the way Selenium loads the URL's from an array.