Integrate web image from rss feed to Ruby

275 views Asked by At

It may be unclear but I'll do my best. I'm currently using dashing, the dashboard designer (sinatra based) with the RSS widget. The thing is that I am unable to get the little image before each RSS item:

<description>
&lt;img style='vertical-align:middle' src='http://pitre-web.tpg.ch/images?ligne=D' title='Perturbation Line D' alt='Perturbation Line D' /&gt;
&lt;br/&gt;&lt;br/&gt;21:03 - THEME - Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</description>

I know the code looks a bit strange but on the webpage all the stuff until 21:03 is ignored. How can I integrate the small logo to the page or at least get the line number (it's a bus line -> here it's D) in order to integrate in plain text in my widget? I don't know if that helps, but I am using nogokiri to fetch the XML from the RSS feed. So what could i put there to fetch this piece of information?

summary = clean_html( news_item.xpath('description').text )

Thanks in advance :)

1

There are 1 answers

4
the Tin Man On BEST ANSWER

The content of the <description> tag is HTML-encoded, so it needs to be decoded back to HTML, then reparsed:

require 'nokogiri'

doc = Nokogiri::XML::DocumentFragment.parse(<<EOT)
<description>
&lt;img style='vertical-align:middle' src='http://pitre-web.tpg.ch/images?ligne=D' title='Perturbation Line D' alt='Perturbation Line D' /&gt;
&lt;br/&gt;&lt;br/&gt;21:03 - THEME - Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</description>
EOT

This is how to locate the tag:

description_text = doc.at('description')

To access its content use:

description_text = doc.at('description').text 
# => "\n<img style='vertical-align:middle' src='http://pitre-web.tpg.ch/images?ligne=D' title='Perturbation Line D' alt='Perturbation Line D' />\n<br/><br/>21:03 - THEME - Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\n"

To do something with that content:

description_doc = Nokogiri::HTML::DocumentFragment.parse(description_text)
description_doc.at('img')['src'] # => "http://pitre-web.tpg.ch/images?ligne=D"

The real XML doesn't match what was given in the question. Here's a better example showing what is being encountered:

<?xml version='1.0' encoding='UTF-8'?>
<rss>
  <channel>
    <title />
    <description />
    <item>
      <description>
&lt;img style='vertical-align:middle' src='http://pitre-web.tpg.ch/images?ligne=2' title='Perturbation Ligne 2' alt='Perturbation Ligne 2' /&gt;
      &lt;br/&gt;&lt;br/&gt;18:47 - Surcharge de trafic - Retard de 8 minutes entre Marbriers et Gen&amp;egrave;ve-Plage.
      </description>
    </item>
    <item>
      <description>
&lt;img style='vertical-align:middle' src='http://pitre-web.tpg.ch/images?ligne=19' title='Perturbation Ligne 19' alt='Perturbation Ligne 19' /&gt;
      &lt;br/&gt;&lt;br/&gt;18:43 - Cimeti&amp;egrave;re Saint-Georges - direction Vernier-Village - Incident &amp;agrave; bord du v&amp;eacute;hicule - Immobilisation du v&amp;eacute;hicule
      </description>
    </item>
    </channel>
</rss>

Based on that, here's code that works to extract the URLs:

require 'nokogiri'
doc = Nokogiri::XML(open('xml'))
img_srces = doc.search('item description').map{ |description|
  desc_doc = Nokogiri::HTML(description.text)
  desc_doc.at('img')['src']
}
img_srces
# => ["http://pitre-web.tpg.ch/images?ligne=2",
#     "http://pitre-web.tpg.ch/images?ligne=19"]