How to store relational data in elasticsearch

12.5k views Asked by At

What are the options to store relational data in elasticsearch. I know the following approaches

  1. Nested object :- I don't want to store data in nested format because I want to update the one document without changing the other document and if I use nested object then there will be repetition of child data in parent documents.

  2. Parent-child :- I don't want to store data in single index, but for using Parent-child data needs to be present in one index(different types). I know this restriction will be removed in future release as mentioned in https://github.com/elastic/elasticsearch/issues/15613 issue, but I want a solution that should work with 5.5 version.

Is there any other approach other then above.

3

There are 3 answers

0
AudioBubble On

There are two more approaches: Denormalization and running multiple queries for joins.

Denormalization will eat up some more space and increase your write time, but you will just need to run one query to retrieve your data, hence, your read time will improve. Since you don't want to store data in a single index, so joining might help you out.

6
Hatim Stovewala On

Nested Object is a perfect approach for it. There will be no repetition of child objects in parent document if you update the child objects correctly. I'm using the same approach for one of my use case where I need to maintain relational data of Master-Child One-to-Many relationship. I've written a Painless script for Update API to Add & Update existing nested child objects inside parent document without creating duplicates or repetitive entries.

Updated Answer:

Below is the structure of Parent-Child Nested Type document with embedded nested type documents "childs".

{
    "parent_id": 1,
    "parent_name": "ABC",
    "parent_number": 123,
    "parent_addr": "123 6th St. Melbourne, FL 32904"
    "childs": [
      {
        "child_id": 1,
        "child_name": "PQR",
        "child_number": 456,
        "child_age": 10
      },
      {
        "child_id": 2,
        "child_name": "XYZ",
        "child_number": 789,
        "child_age": 12
      },
      {
        "child_id": 3,
        "child_name": "QWE",
        "child_number": 234,
        "child_age": 16
      }

    ]   
}

Mapping would be as below:

PUT parent/
{
  "parent": {
    "mappings": {
      "parent": {
        "properties": {
          "parent_id": {
            "type": "long"
          },
          "parent_name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "parent_number": {
            "type": "long"
          },
          "parent_addr": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "child_tickets": {
            "type": "nested",
            "properties": {
              "child_id": {
                "type": "long"
              },
              "child_name": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "child_number": {
                "type": "long"
              },
              "child_age": {
                "type": "long"
              }
            }
          }
        }
      }
    }
  }
}

In RDMS, these both entities(parent, child) are two different tables with One to Many relation between Parent -> Child. Parent's id is foreign key for Child's row. (id is must for both tables)

Now in Elasticsearch, to index the parent document we must have id to index it, in this case it is parent_id. Index Parent document Query(parent_id is the id which i was talking about and have index the document with id(_id) = 1):

POST parent/parent/1
{
    "parent_id": 1,
    "parent_name": "ABC",
    "parent_number": 123,
    "parent_addr": "123 6th St. Melbourne, FL 32904"
}

Now, adding child(s) to the parent. For that you will require child document which should have child id plus parent id. To add a child, parent id is must. Below is the update query to add new childs or update already present childs.

POST parent/parent/1/_update
{
    "script":{
    "lang":"painless",
    "inline":"if (!ctx._source.containsKey(\"childs\")) {
                ctx._source.childs = [];
                ctx._source.childs.add(params.child);
            } else {
                int flag=0;
                for(int i=0;i<ctx._source.childs.size();i++){
                    if(ctx._source.childs[i].child_id==params.child.child_id){
                        ctx._source.childs[i]=params.child;
                        flag++;
                    }
                }
                if(flag==0){
                    ctx._source.childs.add(params.child);
                }
            }",
    "params":{
        "child":{
                "child_id": 1,
                "child_name": "PQR",
                "child_number": 456,
                "child_age": 10
            }
        }
    }
}

Give it a shot. Cheers!

Let me know if you need anything else.

0
Nacho On

There are four mechanisms that can be used to provide relational data modelling support. Each has its pros and cons, making them useful for different situations... here's a summary:

Inner Object

  • Easy, fast, performant
  • Only applicable when one-to-one relationships are maintained
  • No need for special queries

Nested

  • Nested docs are stored in the same Lucene block as each other, which helps read/query performance. Reading a nested doc is faster than the equivalent parent/child.
  • Updating a single field in a nested document (parent or nested children) forces ES to reindex the entire nested document. This can be very expensive for large nested docs
  • "Cross referencing" nested documents is impossible
  • Best suited for data that does not change frequently

Parent/Child

  • Children are stored separately from the parent, but are routed to the same shard. So parent/children are slightly less performance on read/query than nested
  • Parent/child mappings have a bit extra memory overhead, since ES maintains a "join" list in memory
  • Updating a child doc does not affect the parent or any other children, which can potentially save a lot of indexing on large docs
  • Sorting/scoring can be difficult with Parent/Child since the Has
  • Child/Has Parent operations can be opaque at times

Denormalization

  • You get to manage all the relations yourself!
  • Most flexible, most administrative overhead
  • May be more or less performant depending on your setup

For more info, please go to: https://www.elastic.co/blog/managing-relations-inside-elasticsearch