Problems with text/csv Content-Encoding = UTF-8 in Ruby Mechanize

1.9k views Asked by At

When attempting to load a page which is a CSV that has encoding of UTF-8, using Mechanize V2.5.1, I used the following code:

a.content_encoding_hooks << lambda{|httpagent, uri, response, body_io|
 response['Content-Encoding'] = 'none' if response['Content-Encoding'].to_s == 'UTF-8'
}
p4 = a.get(redirect_url, nil, ['accept-encoding' => 'UTF-8'])

but I find that the content encoding hook is not being called and I get the following error and traceback:

/Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:787:in 'response_content_encoding': unsupported content-encoding: UTF-8 (Mechanize::Error)
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:274:in 'fetch'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:949:in 'response_redirect'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:299:in 'fetch'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:949:in 'response_redirect'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:299:in 'fetch'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize.rb:407:in 'get'
    from prototype/test1.rb:307:in `<main>'

Does anyone have an idea why the content hook code is not firing and why I am getting the error?

1

There are 1 answers

4
7stud On BEST ANSWER

but I find that the content encoding hook is not being called

What makes you think that?

The error message references this code:

  def response_content_encoding response, body_io
    ...
    ...

    out_io = case response['Content-Encoding']
             when nil, 'none', '7bit', "" then
               body_io
             when 'deflate' then
               content_encoding_inflate body_io
             when 'gzip', 'x-gzip' then
               content_encoding_gunzip body_io
             else
               raise Mechanize::Error,
                 "unsupported content-encoding: #{response['Content-Encoding']}"

So mechanize only recognizes the content encodings: '7bit', 'deflate', 'gzip', or 'x-gzip'.

From the HTTP/1.1 spec:

4.11 Content-Encoding

The Content-Encoding entity-header field is used as a modifier to the media-type. When present, its value indicates what additional content codings have been applied to the entity-body, and thus what decoding mechanisms must be applied in order to obtain the media-type referenced by the Content-Type header field. Content-Encoding is primarily used to allow a document to be compressed without losing the identity of its underlying media type.

   Content-Encoding  = "Content-Encoding" ":" 1#content-coding

Content codings are defined in section 3.5. An example of its use is

   Content-Encoding: gzip

The content-coding is a characteristic of the entity identified by the Request-URI. Typically, the entity-body is stored with this encoding and is only decoded before rendering or analogous usage. However, a non-transparent proxy MAY modify the content-coding if the new coding is known to be acceptable to the recipient, unless the "no-transform" cache-control directive is present in the message.

... ...

3.5 Content Codings

Content coding values indicate an encoding transformation that has been or can be applied to an entity. Content codings are primarily used to allow a document to be compressed or otherwise usefully transformed without losing the identity of its underlying media type and without loss of information. Frequently, the entity is stored in coded form, transmitted directly, and only decoded by the recipient.

   content-coding   = token

All content-coding values are case-insensitive. HTTP/1.1 uses content-coding values in the Accept-Encoding (section 14.3) and Content-Encoding (section 14.11) header fields. Although the value describes the content-coding, what is more important is that it indicates what decoding mechanism will be required to remove the encoding.

The Internet Assigned Numbers Authority (IANA) acts as a registry for content-coding value tokens. Initially, the registry contains the following tokens:

gzip An encoding format produced by the file compression program "gzip" (GNU zip) as described in RFC 1952 [25]. This format is a Lempel-Ziv coding (LZ77) with a 32 bit CRC.

compress The encoding format produced by the common UNIX file compression program "compress". This format is an adaptive Lempel-Ziv-Welch coding (LZW).

    Use of program names for the identification of encoding formats
    is not desirable and is discouraged for future encodings. Their
    use here is representative of historical practice, not good
    design. For compatibility with previous implementations of HTTP,
    applications SHOULD consider "x-gzip" and "x-compress" to be
    equivalent to "gzip" and "compress" respectively.

deflate The "zlib" format defined in RFC 1950 [31] in combination with the "deflate" compression mechanism described in RFC 1951 [29].

identity The default (identity) encoding; the use of no transformation whatsoever. This content-coding is used only in the Accept- Encoding header, and SHOULD NOT be used in the Content-Encoding header.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5

In other words, an http content encoding has nothing to do with ascii v. utf-8 v. latin-1.

In addition the source code for Mechanize::HTTP::Agent has this in it:

  # A list of hooks to call after retrieving a response.  Hooks are called with
  # the agent and the response returned.
  attr_reader :post_connect_hooks

  # A list of hooks to call before making a request.  Hooks are called with
  # the agent and the request to be performed.
  attr_reader :pre_connect_hooks

  # A list of hooks to call to handle the content-encoding of a request.
  attr_reader :content_encoding_hooks

So it doesn't even look like you are calling the right hook.

Here is an example I got to work:

require 'mechanize'

a = Mechanize.new

p a.content_encoding_hooks

func = lambda do |a, uri, resp, body_io| 
  puts body_io.read
  puts "The Content-Encoding is: #{resp['Content-Encoding']}"

  if resp['Content-Encoding'].to_s == 'UTF-8'
    resp['Content-Encoding'] = 'none'
  end

  puts "The Content-Encoding is now: #{resp['Content-Encoding']}"
end

a.content_encoding_hooks << func

a.get(
  'http://localhost:8080/cgi-bin/myprog.rb',
  [],
  nil,
  "Accept-Encoding" => 'gzip, deflate'  #This is what Firefox always uses
)

myprog.rb:

#!/usr/bin/env ruby

require 'cgi'

cgi = CGI.new('html3')

headers = {
  "type" => 'text/html',
  "Content-Encoding" => "UTF-8",
}

cgi.out(headers) do
  cgi.html() do
    cgi.head{ cgi.title{"Content-Encoding Test"} } +
    cgi.body() do
      cgi.div(){ "The Accept-Encoding was: #{cgi.accept_encoding}" }
    end
  end
end

--output:--
[]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><HTML><HEAD><TITLE>Content-Encoding Test</TITLE></HEAD><BODY><DIV>The Accept-Encoding was: gzip, deflate</DIV></BODY></HTML>
The Content-Encoding is: UTF-8
The Content-Encoding is now: none