Parsing Google Custom Search API for Elasticsearch Documents

262 views Asked by At

After retrieving results from the Google Custom Search API and writing it to JSON, I want to parse that JSON to make valid Elasticsearch documents. You can configure a parent - child relationship for nested results. However, this relationship seems to not be inferred by the data structure itself. I've tried automatically loading, but not results.

Below is some example input that doesn't include things like id or index. I'm trying to focus on creating the correct data structure. I've tried modifying graph algorithms like depth-first-search but am running into problems with the different data structures.

Here's some example input:

# mock data structure
google = {"content": "foo", 
          "results": {"result_one": {"persona": "phone",
                                     "personb":  "phone",
                                     "personc":  "phone"
                                    },
                      "result_two": ["thing1",
                                     "thing2",
                                     "thing3"
                                    ],
                      "result_three": "none"
                     },
          "query": ["Taylor Swift", "Bob Dole", "Rocketman"]
}

# correctly formatted documents for _source of elasticsearch entry
correct_documents = [
    {"content":"foo"},
    {"results": ["result_one", "result_two", "result_three"]},
    {"result_one": ["persona", "personb", "personc"]},
    {"persona": "phone"},
    {"personb": "phone"},
    {"personc": "phone"},
    {"result_two":["thing1","thing2","thing3"]},
    {"result_three": "none"},
    {"query": ["Taylor Swift", "Bob Dole", "Rocketman"]}
]

Here is my current approach this is still a work in progress:

def recursive_dfs(graph, start, path=[]):
  '''recursive depth first search from start'''
  path=path+[start]
  for node in graph[start]:
    if not node in path:
      path=recursive_dfs(graph, node, path)
  return path

def branching(google):
    """ Get branches as a starting point for dfs"""
    branch = 0
    while branch < len(google):

        if google[google.keys()[branch]] is dict:

            #recursive_dfs(google, google[google.keys()[branch]])
            pass

        else:
            print("branch {}: result {}\n".format(branch,     google[google.keys()[branch]]))

        branch += 1

branching(google)

You can see that recursive_dfs() still needs to be modified to handle string, and list data structures.

I'll keep going at this but if you have thoughts, suggestions, or solutions then I would very much appreciate it. Thanks for your time.

1

There are 1 answers

0
Bobby Mcd On BEST ANSWER

here is a possible answer to your problem.

def myfunk( inHole, outHole):
    for keys in inHole.keys():
        is_list = isinstance(inHole[keys],list);
        is_dict = isinstance(inHole[keys],dict);
        if is_list:
            element = inHole[keys];
            new_element = {keys:element};
            outHole.append(new_element);
        if is_dict:
            element = inHole[keys].keys();
            new_element = {keys:element};
            outHole.append(new_element);
            myfunk(inHole[keys], outHole);
        if not(is_list or is_dict):
            new_element = {keys:inHole[keys]};
            outHole.append(new_element);
    return outHole.sort();