Elasticsearch ingest pipeline: how to recursively modify values in a HashMap

1k views Asked by At

Using an ingest pipeline, I want to iterate over a HashMap and remove underscores from all string values (where underscores exist), leaving underscores in the keys intact. Some values are arrays that must further be iterated over to do the same modification.

In the pipeline, I use a function to traverse and modify the values of a Collection view of the HashMap.

PUT /_ingest/pipeline/samples
{
    "description": "preprocessing of samples.json",
    "processors": [
        {
            "script": {
                "tag": "remove underscore from sample_tags values",
                "source": """
                    void findReplace(Collection collection) {
                    collection.forEach(element -> {
                        if (element instanceof String) {
                            element.replace('_',' ');
                        } else {
                            findReplace(element);
                        }
                        return true;
                        })
                    }

                    Collection samples = ctx.samples;
                    samples.forEach(sample -> { //sample.sample_tags is a HashMap
                        Collection sample_tags = sample.sample_tags.values();
                        findReplace(sample_tags);
                        return true;
                    })
                """
            }
        }
    ]
}

When I simulate the pipeline ingestion, I find the string values are not modified. Where am I going wrong?

POST /_ingest/pipeline/samples/_simulate
{
    "docs": [
        {
            "_index": "samples",
            "_id": "xUSU_3UB5CXFr25x7DcC",
            "_source": {
                "samples": [
                    {
                        "sample_tags": {
                            "Entry_A": [
                                "A_hyphentated-sample",
                                "sample1"
                            ],
                            "Entry_B": "A_multiple_underscore_example",
                            "Entry_C": [
                                        "sample2",
                                        "another_example_with_underscores"
                            ],
                            "Entry_E": "last_example"
                        }
                    }
                ]
            }
        }
    ]
}

\\Result

{
  "docs" : [
    {
      "doc" : {
        "_index" : "samples",
        "_type" : "_doc",
        "_id" : "xUSU_3UB5CXFr25x7DcC",
        "_source" : {
          "samples" : [
            {
              "sample_tags" : {
                "Entry_E" : "last_example",
                "Entry_C" : [
                  "sample2",
                  "another_example_with_underscores"
                ],
                "Entry_B" : "A_multiple_underscore_example",
                "Entry_A" : [
                  "A_hyphentated-sample",
                  "sample1"
                ]
              }
            }
          ]
        },
        "_ingest" : {
          "timestamp" : "2020-12-01T17:29:52.3917165Z"
        }
      }
    }
  ]
}

2

There are 2 answers

12
Val On BEST ANSWER

Here is a modified version of your script that will work on the data you provided:

PUT /_ingest/pipeline/samples
{
  "description": "preprocessing of samples.json",
  "processors": [
    {
      "script": {
        "tag": "remove underscore from sample_tags values",
        "source": """
          String replaceString(String value) {
            return value.replace('_',' ');
          }
      
          void findReplace(Map map) {
            map.keySet().forEach(key -> {
              if (map[key] instanceof String) {
                  map[key] = replaceString(map[key]);
              } else {
                  map[key] = map[key].stream().map(this::replaceString).collect(Collectors.toList());
              }
            });
          }

          ctx.samples.forEach(sample -> {
              findReplace(sample.sample_tags);
              return true;
          });
          """
      }
    }
  ]
}

The result looks like this:

     {
      "samples" : [
        {
          "sample_tags" : {
            "Entry_E" : "last example",
            "Entry_C" : [
              "sample2",
              "another example with underscores"
            ],
            "Entry_B" : "A multiple underscore example",
            "Entry_A" : [
              "A hyphentated-sample",
              "sample1"
            ]
          }
        }
      ]
    }
0
Joe - Check out my books On

You were on the right path but you were working on copies of values and weren't setting the modified values back onto the document context ctx which is eventually returned from the pipeline. This means you'll need to keep track of the current iteration indexes -- so for the array lists, as for the hash maps and everything in between -- so that you can then target the fields' positions in the deeply nested context.

Here's an example taking care of strings and (string-only) array lists. You'll need to extend it to handle hash maps (and other types) and then perhaps extract the whole process into a separate function. But AFAIK you cannot return multiple data types in Java so it may be challenging...

PUT /_ingest/pipeline/samples
{
  "description": "preprocessing of samples.json",
  "processors": [
    {
      "script": {
        "tag": "remove underscore from sample_tags values",
        "source": """
          ArrayList samples = ctx.samples;
        
          for (int i = 0; i < samples.size(); i++) {
              def sample = samples.get(i).sample_tags;
              
              for (def entry : sample.entrySet()) {
                  def key = entry.getKey();
                  def val = entry.getValue();
                  def replaced_val;
                  
                  if (val instanceof String) {
                    replaced_val = val.replace('_',' ');
                  } else if (val instanceof ArrayList) {
                    replaced_val = new ArrayList();
                    for (int j = 0; j < val.length; j++) {
                        replaced_val.add(val[j].replace('_',' ')); 
                    }
                  } 
                  // else if (val instanceof HashMap) {
                    // do your thing
                  // }
                  
                  // crucial part
                  ctx.samples[i][key] = replaced_val;
              }
          }
        """
      }
    }
  ]
}