I have a script that extracts the urls from a webpage, loads each link and extracts the data I require. However upon further investigation I find that Selenium isn't opening the link as it should and is duplicating information. eg as follows:

link 1 - Title:ABC <-- within link, extract 123
link 2 - Title:DEF <-- within link, extract 456
link 3 - Title:GHI <-- within link, extract 789
link 4 - Title:JKL <-- within link, extract 000

the output should be as follows:

ABC, 123
DEF, 456
GHI, 789
JKL, 000

however the output I get is as follows:

ABC, 123
ABC, 123
GHI, 789
JKL, 000

this behavior seems to be random.

Here is the code

elems = driver.find_elements_by_xpath(alllinks)
for elem in elems:
    links.append(elem.get_attribute("href"))
    for url in links:
        try:
            time.sleep(0.5) 
            driver.get(url) 
            time.sleep(2)

anyone experience this type of behavior ?

EDIT: UPDATE:

An update on this, I have scraped just the URL's 3 times, and compared the results with each other. The URLs are unique and extracted according to the site. From what I see, its the way Selenium loads the URL's from an array.

2 Answers

0
qbbq On Best Solutions

thought I'd provide the answer to my question. It wasn't a problem with Selenium, but a problem with my code. I have a try/except code block that would execute if an element exists, if it does, the variable is appended with the extracted text. however if the element didn't exist, the variable would contain the text from the previous loop and write it to file.

to circumvent this, I have a del at the end of the for loop - there might be other more elegant ways of doing this, but it does solve my problem.

0
supputuri On

If I understand your query correctly, You should load the latest href, rather iterating through all hrefs each time.

elems = driver.find_elements_by_xpath(alllinks)
previousTitle = ''
for elem in elems:
    url = elem.get_attribute("href")
    links.append(url)
    driver.get(url) 
    # make sure to wait until the title is changed (no issue until 2 urls have same title)
    wait.until_not(EC.title_is(previousTitle)) 
    previousTitle = driver.title