Why xPath search works in REXML but not with Hpricot?

292 views Asked by At

I am using Rails 3.2 and Hpricot.

I’d like to find an XML element by the content of its child element and convert it to a Ruby object, which later shall be rendered.

In other words, I’d like to find the ‘vehicle’ element where its child ‘line_number’ content equals 1234.

This worked fine with REXML and following xPath:

/gsip/vehicle[line_number[text()=1234]]

REXML is slow, so I switched to Hpricot where the same xPath finds all vehicle elements not only the one where ‘line_number’ equals 1234.

Why does this find all vehicles?

file_path = Rails.root.join('public','gsip','gsip-vehicle-data.xml')
q = "/gsip/vehicle[line_number[text()=#{params[:id]}]]"
@vehicle_data = { :date => Date.today - 10.years }   # initiate with very old date
xmldoc = File.read(file_path)
doc = Hpricot::XML(xmldoc)

doc.search(q) do |e|
  if e.at('line_number').innerText == params[:id]  # This line shouldn't be necessary?!
    logger.info( "#{e.at('pa_number').innerText} (#{e.at('line_number').innerText} from #{e.at('date').innerText})" )

    vehicle_date = Date.strptime(e.at('date').innerText, "%d.%m.%Y")
    #logger.info('date: ' + vehicle_date.to_s)

    if vehicle_date > @vehicle_data[:date]
      e.children.select do |n|
        logger.info("#{n.name} = #{n.innerText}")
        @vehicle_data[n.name] = n.innerText
      end
    end

  end
end

This finds the searched vehicle, but is slow:

file_path = Rails.root.join('public','gsip','gsip-vehicle-data.xml')
q = "/gsip/vehicle[line_number[text()=#{params[:id]}]]"
@vehicle_data = { :date => Date.today - 10.years }   # initiate with very old date

XPath.each(xmldoc, q ) { |e|
  #find the latest vehicle with given line_number
  vehicle_date = Date.strptime(XPath.first(e,'date').text, "%d.%m.%Y")

  if vehicle_date > @vehicle_data[:date]
    e.elements.each { |n|
      @vehicle_data[n.name] = n.text
    }
  end
}

My XML:

<gsip export_date="7/25/2012 12:04:27 PM" schema_version="1.01">
  <vehicle id="ABC">
    <date>02.07.2012</date>
    <line_number>1234</line_number>
    <pa_number>ABC</pa_number>
    <vin>VIN</vin>
    <my>2012</my>
  </vehicle>
  <vehicle id="ABD">
    <date>02.07.2012</date>
    <line_number>8348</line_number>
    <pa_number>ABD</pa_number>
    <vin>VIN</vin>
    <my>2012</my>
  </vehicle>
  <vehicle>
  ...
  </vehicle>
  ...
</gsip>

UPDATE

My switch to Nokogiri:

My request (localhost) has gone down from 4seconds to 250ms. My XML File is 5.6MB. Since it might be helpful for others I pasted my changes below:

class IncidentsController < ApplicationController
  require 'nokogiri'

  # ....

  def vehicle
    # helpfull links: =============================================================================
    # Some say Nokogire is best:  http://nokogiri.org/
    # recursive link: http://stackoverflow.com/questions/11665126/why-xpath-search-works-in-rexml-but-not-with-hpricot
    # =============================================================================================

    # check if PA Number or Line Number is given:
    num = ''
    if params[:id] =~ /^\d{4}$/
      num = 'line_number'
    elsif params[:id] =~ /^[\d\w]{6}$/
      num = 'pa_number'
    elsif params[:id] =~ /^[\d\w]{17}$/
      num = 'vin'
    end

    # read Vehicle Data from XML File
    file_path = Rails.root.join('private','gsip','gsip-vehicle-data.xml')
    q = "/gsip/vehicle[#{num}/text()='#{params[:id]}']"

    @vehicle_data = { :date => Date.today - 10.years }   # initiate with very old date
    #logger.info("*** Find Vehicle Data in XML. xPath: #{q}")

    doc = Nokogiri::XML( File.read(file_path) )

    doc.xpath(q).each do |e|
      vehicle_date = Date.strptime(e.xpath('date').first.content, "%d.%m.%Y")
      #logger.info("Date: #{vehicle_date.to_s}")
      if vehicle_date > @vehicle_data[:date]
      e.element_children.all? do |n|
        @vehicle_data[n.name] = n.content
      end
      end
    end

    respond_to do |format|
      format.html { redirect_to connectors_path }
      format.json { render :json => @vehicle_data }
      format.xml { render :xml => @vehicle_data }
    end
  end

  # ...
end

I'm new with Rails, so further comments on my code are welcome!

1

There are 1 answers

3
Mark Thomas On

Hpricot was wonderful when it first came on the scene because it introduced the CSS selector syntax to HTML parsing. However, it wasn't ever completely XPath compliant, particularly around XPath predicate syntax, which you are using.

I would suggest Nokogiri. This library is fast and well-maintained, and is fully XPath 1.0 compliant. With it you should be able to pull the vehicle:

doc.search('//vehicle[line_number[text()=1234]]')

Also, a slight simplification: you really don't need nested predicates. This will also identify the correct vehicle:

doc.search('//vehicle[line_number/text()=1234]')