I have HTML pages that contains image tags with 2 'src' attributes, and I want to use BS to extract the first 'src' and not the second 'src'.
For example:
When I use BS as follows:
from bs4 import BeautifulSoup
html_doc = <img class="lazy" src="https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg?crop=1.00xw:0.669xh;0,0.190xh&resize=980:*" src="https://www.mdf.qa/media/catalog/product/cache/1/image/800x800/9df78eab33525d08d6e5fb8d27136e95/b/l/black_2.jpg"/>
soup = BeautifulSoup(html_doc, 'html.parser')
bs_images = soup.find_all('img')
for bs_image in bs_images:
attrs = bs_image.attrs
image_path = attrs['src']
The path I'm getting is the second src "https://www.mdf.qa/media/catalog/product/cache/1/image/800x800/9df78eab33525d08d6e5fb8d27136e95/b/l/black_2.jpg" but I need the first src - https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg?crop=1.00xw:0.669xh;0,0.190xh&resize=980:* .
It seems that BeautifulSoup is rewriting second src on the top of the first so the first src is not stored anywhere. I would sugest using regex for this problem.
Here is the link to src match. With re.search we only get first match (with findall would we get all matches).