How to use Python's BeautifulSoup html parser to get img tag src with 2 'src' attributes

Question

How to use Python's BeautifulSoup html parser to get img tag src with 2 'src' attributes

356 views Asked by Noam Baron At 27 December 2024 at 09:58

I have HTML pages that contains image tags with 2 'src' attributes, and I want to use BS to extract the first 'src' and not the second 'src'.

For example:

When I use BS as follows:

from bs4 import BeautifulSoup

html_doc = <img class="lazy" src="https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg?crop=1.00xw:0.669xh;0,0.190xh&resize=980:*"                        src="https://www.mdf.qa/media/catalog/product/cache/1/image/800x800/9df78eab33525d08d6e5fb8d27136e95/b/l/black_2.jpg"/>

soup = BeautifulSoup(html_doc, 'html.parser')
bs_images = soup.find_all('img')
for bs_image in bs_images:
   attrs = bs_image.attrs
   image_path = attrs['src']

The path I'm getting is the second src "https://www.mdf.qa/media/catalog/product/cache/1/image/800x800/9df78eab33525d08d6e5fb8d27136e95/b/l/black_2.jpg" but I need the first src - https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg?crop=1.00xw:0.669xh;0,0.190xh&resize=980:* .

Original Q&A

There are 1 answers

**joc** · Answer 1 · 2020-05-10T15:06:58+00:00

It seems that BeautifulSoup is rewriting second src on the top of the first so the first src is not stored anywhere. I would sugest using regex for this problem.

import re

html_doc = '<img class="lazy" src="https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg?crop=1.00xw:0.669xh;0,0.190xh&resize=980:*"                        src="https://www.mdf.qa/media/catalog/product/cache/1/image/800x800/9df78eab33525d08d6e5fb8d27136e95/b/l/black_2.jpg"/>'

bs_images = re.findall('<img[^<>]+>', html_doc)
for bs_image in bs_images:
   image_path = re.search('src="([^"]+)"', bs_image).group(1)
   print(image_path)

Here is the link to src match. With re.search we only get first match (with findall would we get all matches).

TechQA.

How to use Python's BeautifulSoup html parser to get img tag src with 2 'src' attributes

There are 1 answers

Related Questions in PYTHON

Related Questions in BEAUTIFULSOUP

Related Questions in HTML5-IMG

Popular Questions

Popular Tags

Trending Questions