How to handle returned errors from applying isbnlib.meta with pandas

349 views Asked by At

I'm using isbnlib.meta which pulls metadata (book title, author, year publisher, etc.) when you enter in an isbn. I have a dataframe with 482,000 isbns (column title: isbn13). When I run the function, I'll get an error like NotValidISBNError which stops the code in it's tracks. What I want to happen is if there is an error the code will simply skip that row and move onto the next one.

Here is my code now:

list_df[0]['publisher_isbnlib'] = list_df[0]['isbn13'].apply(lambda x: isbnlib.meta(x).get('Publisher', None))
list_df[0]['yearpublished_isbnlib'] = list_df[0]['isbn13'].apply(lambda x: isbnlib.meta(x).get('Year', None))
#list_df[0]['language_isbnlib'] = list_df[0]['isbn13'].apply(lambda x: isbnlib.meta(x).get('Language', None))
list_df[0]

list_df[0] is the first 20,000 rows since I'm trying to chunk through the dataframe. I've just manually entered in this code 24 times to handle each chunk.

I attempted try: and except: but all that ends up happening is the code stops and I don't get any meta data reported.

Traceback:

---------------------------------------------------------------------------
NotValidISBNError                         Traceback (most recent call last)
<ipython-input-39-a06c45d36355> in <module>
----> 1 df['meta'] = df.isbn.apply(isbnlib.meta)

e:\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   4198             else:
   4199                 values = self.astype(object)._values
-> 4200                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4201 
   4202         if len(mapped) and isinstance(mapped[0], Series):

pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

e:\Anaconda3\lib\site-packages\isbnlib\_ext.py in meta(isbn, service)
     23 def meta(isbn, service='default'):
     24     """Get metadata from Google Books ('goob'), Open Library ('openl'), ..."""
---> 25     return query(isbn, service) if isbn else {}
     26 
     27 

e:\Anaconda3\lib\site-packages\isbnlib\dev\_decorators.py in memoized_func(*args, **kwargs)
     22             return cch[key]
     23         else:
---> 24             value = func(*args, **kwargs)
     25             if value:
     26                 cch[key] = value

e:\Anaconda3\lib\site-packages\isbnlib\_metadata.py in query(isbn, service)
     18     if not ean:
     19         LOGGER.critical('%s is not a valid ISBN', isbn)
---> 20         raise NotValidISBNError(isbn)
     21     isbn = ean
     22     # only import when needed

NotValidISBNError: (abc) is not a valid ISBN
2

There are 2 answers

2
Trenton McKinney On BEST ANSWER
  • The current implementation for extracting isbn meta data, is incredibly slow and inefficient.
    • As stated, there are 482,000 unique isbn values, for which the data is being downloaded multiple times (e.g. once for each column, as the code is currently written)
  • It will be better to download all the meta data at once, and then extract the data from the dict, as a separate operation.
  • A try-except block is used to capture the error from invalid isbn values.
    • An empty dict, {} is returned, because pd.json_normalize won't work with NaN or None.
    • It will be unnecessary to chunk the isbn column.
  • pd.json_normalize is used to expand the dict returned from .meta.
  • Use pandas.DataFrame.rename to rename columns, and pandas.DataFrame.drop to delete columns.
  • This implementation will be significantly faster than the current implementation, and will make far fewer requests to the API being used to get the meta data.
  • To extract values from lists, such as the 'Authors' column, use df_meta = df_meta.explode('Authors'); if there is more than one author, a new row will be created for each additional author in the list.
import pandas as pd  # version 1.1.3
import isbnlib  # version 3.10.3

# sample dataframe
df = pd.DataFrame({'isbn': ['9780446310789', 'abc', '9781491962299', '9781449355722']})

# function with try-except, for invalid isbn values
def get_meta(col: pd.Series) -> dict:
    try:
        return isbnlib.meta(col)
    except isbnlib.NotValidISBNError:
        return {}


# get the meta data for each isbn or an empty dict
df['meta'] = df.isbn.apply(get_meta)

# df
            isbn                                                                                                                                                                                                                                                   meta
0  9780446310789                                                                                   {'ISBN-13': '9780446310789', 'Title': 'To Kill A Mockingbird', 'Authors': ['Harper Lee'], 'Publisher': 'Grand Central Publishing', 'Year': '1988', 'Language': 'en'}
1            abc                                                                                                                                                                                                                                                     {}
2  9781491962299  {'ISBN-13': '9781491962299', 'Title': 'Hands-On Machine Learning With Scikit-Learn And TensorFlow - Techniques And Tools To Build Learning Machines', 'Authors': ['Aurélien Géron'], 'Publisher': "O'Reilly Media", 'Year': '2017', 'Language': 'en'}
3  9781449355722                                                                                                                  {'ISBN-13': '9781449355722', 'Title': 'Learning Python', 'Authors': ['Mark Lutz'], 'Publisher': '', 'Year': '2013', 'Language': 'en'}

# extract all the dicts in the meta column
df = df.join(pd.json_normalize(df.meta)).drop(columns=['meta'])

# extract values from the lists in the Authors column
df = df.explode('Authors')

# df
            isbn        ISBN-13                                                                                                         Title         Authors                 Publisher  Year Language
0  9780446310789  9780446310789                                                                                         To Kill A Mockingbird      Harper Lee  Grand Central Publishing  1988       en
1            abc            NaN                                                                                                           NaN             NaN                       NaN   NaN      NaN
2  9781491962299  9781491962299  Hands-On Machine Learning With Scikit-Learn And TensorFlow - Techniques And Tools To Build Learning Machines  Aurélien Géron            OReilly Media   2017       en
3  9781449355722  9781449355722                                                                                               Learning Python       Mark Lutz                            2013       en
0
Maximilian Press On

Hard to answer without seeing the code, but try/except should really be able to handle this.

I am not an expert here, but look at this code:

l = [0, 1, "a", 2, 3]

for item in l:
    try:
        print(item + 1)
    except TypeError as e:
        print(item, "is not integer")             

If you try to do addition with a string, python hates that and backs out with a TypeError. So you capture the TypeError using except and maybe report something about it. When I run this code:

1
2
a is not integer  # exception handled!
3
4

You should be able to handle your exception with except NotValidISBNError, and then reporting whatever metadata you like.

You can get much more sophisticated with exception handling but that is the basic idea.