How can I replace the ISBN with the Google Books ID in a MARC file, using Perl?

1.3k views Asked by At

I've got a file with some book data in MARC format, of which some lines are ISBNs. I'd like to replace these lines with the Google Books ID of that ISBN, if it exists. Here's the code so far, which just ends up removing the lines:

perl -pe "s#ISBN(.*)#$(wget --output-document=- --quiet --user-agent=Mozilla/5.0 \"http://books.google.com/books?jscmd=viewapi&bibkeys=\1\")#mg" < 5-${file} > 6-${file}

PS: Google are a bit fuzzy on the use of automated tools: The Books Data API recommends tools like curl / wget, but there are no instructions on how to avoid being blocked when using such tools. I'm also pretty sure I saw a clause in a ToS saying users can't send automated queries, but I can't find it again. This is discussed in their forum.

2

There are 2 answers

2
mob On BEST ANSWER

I think the OP is on the right track and could use a one-liner for this, and just needs to replace some bash-style syntax with the correct Perl syntax. I think this would work (newlines added for readability):

    perl -pe 's#ISBN(\w+)#qx(wget --output-document=- 
        --quiet --user-agent=Mozilla/5.0 
        "http://books.google.com/books\\?jscmd=viewapi\\&bibkeys=$1")#ge' \
        < 5-${file} > 6-${file}

You have to escape (edit: double escaping seems to work) the $ or & characters in the url.

1
Sinan Ünür On

The reason you end up having to lie about the user agent is because you are violating Google's TOS: Don't do that.

Instead, use the Google Book Search API.

The code below is slightly hampered by my lack of familiarity with modules such as XML::Atom, Data::Feed, WWW::OpenSearch. However, it should provide a good starting point.

#!/usr/bin/perl

use strict;
use warnings;

use Business::ISBN qw( valid_isbn_checksum );
use LWP::Simple;
use XML::Simple;

while ( <> ) {
    s/ISBN:([0-9]+)/'Google Books ID:' . get_google_id_for_isbn($1)/ge;
    print;
}

use Carp;

sub make_google_books_query {
    sprintf 'http://books.google.com/books/feeds/volumes?q=isbn:%s', $_[0];
}

sub get_google_id_for_isbn {
    my ($isbn) = @_;

    my $google_id = eval {
        defined(valid_isbn_checksum $isbn)
            or croak "Invalid ISBN: $isbn";

        my $query = make_google_books_query($isbn);
        my $xml = get $query;

        defined($xml)
            or croak "No response to <$query>";

        my $data = XMLin($xml, ForceArray => 1);
        my @ids = @{ $data->{entry}[0]{'dc:identifier'} };

        unless ("ISBN:$isbn" eq $ids[1]
                or "ISBN:$isbn" eq $ids[2] ) {
            croak "Invalid search results: '@ids'";
        }

        $ids[0];
    };

    defined($google_id) ? $google_id : '';
}

Given a text file t.txt containing:

ISBN:0060930314
ISBN:9780596520106

it outputs:

Google Books ID:ioXFqlzsmK8C
Google Books ID:lNVHi3TunxsC