Storing Media RSS and iTunes podcast RSS feeds in the database

1k views Asked by At

I want to be able to store media RSS and iTunes podcast RSS feeds into the database. The requirement here is that I don't want to miss out on ANY element or its attributes in the feed. It would make sense to find all most common elements in the feed and have them stored in database as separate columns. The catch here is that there can be feed specific elements that may not be standard. I want to capture them too. Since I don't know what they can be, I won't have a dedicated column for them.

Currently I have 2 tables called feeds and feed_entries. For RSS 2.0 tags like enclosures, categories, I have separate tables that have associations with feeds/feed_entries. I am using feedzirra for parsing the feeds. Feedzirra requires us to know the elements in the feed we want to parse and hence we would not know if feed contains elements beyond what feedzirra can understand.

What would be the best way to go about storing these feeds in the database and not miss single bit of information? (Dumping of the whole feed into the database as is won't work as we want to query most of the attributes). What parser would be the best fit? Feedzirra was chosen for performance, however, getting all data in the feed into the database is a priority.

Update

I'm using MySQL as the database.

2

There are 2 answers

0
Lyndon On

Store the XML as a CLOB, most databases have XML processing extensions that allow you to include XPath type queries as part of a SELECT statement.

Otherwise if your DBMS does not support XML querying, use your languages XPath implementation to query the CLOB. You will probably need to extract certain elements into table columns for speedy querying.

0
the Tin Man On

I modeled my database on feeds and entries also, and cross-mapped the fields for RSS, RDF and Atom, so I could capture the required data fields as a starting point. Then I added a few others for tagging and my own internal-summarizations of the feed, plus some housekeeping and maintenance fields.

If you move from Feedzirra I'd recommend temporarily storing the actual feed XML in a staging table so you can post-process it using Nokogiri at your leisure. That way your HTTP process isn't bogged down processing the text, it's just retrieving content and filing it away, and updating the records for the processing time so you know when to check again. The post process can extract the feed information you want from the stored XML to store in the database, then delete the record. That means there's one process pulling in feeds periodically as quickly as it can, and another that basically runs in the background chugging away.

Also, both Typhoeus/Hydra and HTTPClient can handle multiple HTTP requests nicely and are easy to set up.