Migrating a node label in gremlin tinkerpop

29 views Asked by At

I'm working on migrating a graph from an old naming system to a new one and am looking for an efficient way to effectively re-label a given node label without losing node or edge properties or any edges. I am looking to create a function that can do this for any node label with any set of edge labels.

I've been able to copy over all the nodes with their properties intact easily enough, but migrating over the edges has been more of a pain. It's easy enough if I pull all the edges and re-add them one by one, but this approach is incredibly slow, as it requires a query per edge.

In trying to write this all as one query, I've run into two primary issues. The first is that it does not appear to be possible to dynamically apply the edge label. It appears to pass a traverser instead of a value if I try something like:

await g.V().hasLabel('nodeLabel').outE().as('e1').addE(__.select('e1').label())

I guess I could get around this by grabbing the edge list on the node type and dynamically constructing the query like this?:

query = query.outE().hasLabel(edgeLabels[i]).addE(edgeLabels[i]) etc

The second, more significant, issue is in assigning to and from for all edges in one query. If I'm not careful, two things can go wrong. Firstly, an edge to and from the nodes I'm relabeling will not be updated correctly and end up pointing to an old node. Secondly, an edge to and from the node's I'm relabeling can get created twice (once from an inE() step and once after an outE() step). I have considered updating the edges after all the nodes (when I can have a mapping from old node id's to new node ids), but it seems that this would require a lambda, which is out of my current experience. Additionally, the documentation for lambdas is quite sparse, and it's not even clear to me that it would accept a closure containing the map, so I'd like to avoid wrestling with it if possible. This is all relatively trivial to handle with hash maps, but that quickly moves me into slowwww territory as I'm currently handling it, as it explodes my required number of queries into O(E), where E is the number of edges of the vertex label class.

For reference, I'm currently migrating a class with about 30k edges, and it's appearing to require about 2 hours by my estimate.

I hope there is a faster way to do this. I was surprised at how difficult this was, given that this is an operation I assume people undertake with (sparse) regularity.

Below is my current working, but slow, approach:

async migrateNodes(
    selectToDuplicate: gremlin.process.GraphTraversal,
    newLabel: string,
    edgeLabelReplacements: Map<string, string> = new Map<string, string>(),
  ) {
    const originalAndDuplicateIds = (await selectToDuplicate
      .as('original')
      .addV(newLabel)
      .as('duplicate')
      .sideEffect(
        __.select('original')
          .properties()
          .unfold()
          .as('props')
          .select('duplicate')
          .property(__.select('props').key(), __.select('props').value()),
      )
      .project('original', 'duplicate')
      .by(__.select('original').id())
      .by(__.select('duplicate').id())
      .toList()) as Map<any, any>[];

    type OriginalAndDuplicateIds = {
      original: string;
      duplicate: string;
    };

    const originalAndDuplicateIdsParsed = originalAndDuplicateIds.map((ele) =>
      mapToObject<OriginalAndDuplicateIds>(ele),
    );

    await this.duplicateEdges(originalAndDuplicateIdsParsed, edgeLabelReplacements);
  }

  async duplicateEdges(
    originalAndDuplicateIds: any[],
    edgeLabelReplacements: Map<string, string> = new Map<string, string>(),
  ) {
    // original and duplicate projection
    let originalIds = [];
    let duplicateIds = [];
    let limit = pLimit(2000);
    let tasks = [];

    for (let i = 0; i < originalAndDuplicateIds.length; i++) {
      originalIds.push(originalAndDuplicateIds[i].original);
      duplicateIds.push(originalAndDuplicateIds[i].duplicate);
    }

    let seen = new Map<string, boolean>();

    for (let i = 0; i < originalAndDuplicateIds.length; i++) {
      const originalId = originalAndDuplicateIds[i].original;
      const duplicateId = originalAndDuplicateIds[i].duplicate;
      const outEdges = (await this.g.V(originalId).outE().toList()) as Edge[];
      const inEdges = (await this.g.V(originalId).inE().toList()) as Edge[];
      for (let edge of outEdges) {
        let inVId = edge.inV.id;
        if (originalIds.includes(inVId)) {
          inVId = duplicateIds[originalIds.indexOf(inVId)];
        }
        if (seen.has(edge.id)) {
          continue;
        }

        seen.set(edge.id, true);

        let label = edge.label;

        if (edgeLabelReplacements.has(label)) {
          label = edgeLabelReplacements.get(label) as string;
        }

        tasks.push(
          limit(() => {
            this.edgeCounter++;
            console.log('running task ' + this.edgeCounter);
            return this.g
              .E(edge.id)
              .as('e1')
              .V(duplicateId)
              .as('out')
              .V(inVId)
              .as('in')
              .addE(label)
              .from_(__.select('out'))
              .to(__.select('in'))
              .as('e2')
              .sideEffect(
                __.select('e1')
                  .properties()
                  .unfold()
                  .as('props')
                  .select('e2')
                  .property(__.select('props').key(), __.select('props').value()),
              )
              .iterate();
          }),
        );
      }

      for (let edge of inEdges) {
        let outVId = edge.outV.id;
        if (originalIds.includes(outVId)) {
          outVId = duplicateIds[originalIds.indexOf(outVId)];
        }

        if (seen.has(edge.id)) {
          continue;
        }

        seen.set(edge.id, true);

        let label = edge.label;

        if (edgeLabelReplacements.has(label)) {
          label = edgeLabelReplacements.get(label) as string;
        }

        tasks.push(
          limit(() => {
            this.edgeCounter++;
            console.log('running task ' + this.edgeCounter);
            return this.g
              .E(edge.id)
              .as('e1')
              .V(outVId)
              .as('out')
              .V(duplicateId)
              .as('in')
              .addE(label)
              .from_(__.select('out'))
              .to(__.select('in'))
              .as('e2')
              .sideEffect(
                __.select('e1')
                  .properties()
                  .unfold()
                  .as('props')
                  .select('e2')
                  .property(__.select('props').key(), __.select('props').value()),
              )
              .iterate();
          }),
        );
      }

      tasks.push(
        limit(() => {
          this.edgeCounter = 0;
          return;
        }),
      );
    }

    await Promise.all(tasks);
  }
1

There are 1 answers

1
bechbd On

Unfortunately, Gremlin makes this difficult, as labels are immutable so you cannot remove existing labels. From your post, it is a bit unclear if you are looking to relabel the nodes or edges, so I have added information for both below.

If you are using Amazon Neptune, and you want to change the label for a node, you can use the interoperability between openCypher and Gremlin to change the label names using openCypher similar to this:

MATCH (n:airport {code: 'ANC'}) SET n:airport2 REMOVE n:airport RETURN n

If you are looking to change edge labels, then this will require you to reload or recreate all the edges, as edge labels cannot be changed. The best approach here would probably be to export the data following the approach below, modify the edge labels/ids, reload the edge data, then delete all the old edges.

https://docs.aws.amazon.com/neptune/latest/userguide/neptune-data-export.html