I am trying to get all the images from .mht file by using Nokogiri gem. But since the .mht file has quoted-printable encoding, all the images that I received, has weird characters in it:
<img alt='3D"AFC-Logo' src="3D%22https://upload.=" width='3D"75"' height='3D"75"'>
<img src="3D%22https://en.wikipedia.org/static/images/footer/wikimedia-butto=" width='3D"88"' height='3D"31"' alt='3D"Wikimedia'>
<img src="3D%22https://en.wikipedia.org/static/images/footer/poweredby_mediawiki_8=" alt='3D"Powered' width='3D"88"' height='3D"31"'>
This is the link to that .mht file: https://drive.google.com/file/d/1DtbgrFyCEcggAk1nqpZSluNhRt-k3t95/view?usp=sharing
And below is the code that I am using to get all the images from the .mht file:
html = File.open("1646037951.mht").read
image_links = get_image_links(html)
def get_image_links(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//img[@src]")
raise "No <img .../> tags!" if nodes.empty?
nodes.inject([]) do |uris, node|
puts node.to_s
uris << node.attr('src').strip
end.uniq
end
I have tried to parse it by using .unpack('M').first but it's still not working as it just returns the same result as above.
Or maybe Rails have something for this?