Physical identity based alternative to Hashtbl.hash

676 views Asked by At

I'm trying to derive a Graphviz file describing a structured value. This is for diagnostic purposes so I want my graph to mirror the actual structure in memory as closely as possible. I'm using the below to map values to Graphviz vertices so that I can reuse a vertex when a value has two or more inbound references:

let same = (==)

module StateIdentity : Hashtbl.HashedType = struct
  type t = R.meta_t state
  let hash = Hashtbl.hash
  let equal = same
end

module StateHashtbl = Hashtbl.Make (StateIdentity)

The documentation for Hashtbl.hash suggests that it is suitable for use both when StateIdentity.equal = (=) and when StateIdentity.equal = (==) but I'd like to ensure that hash table access is as close to O(1) as possible so would rather not have Hashtbl.hash walking a (potentially large in this case) object graph on every lookup.

I know Ocaml moves references around, but is there an O(1) proxy for reference identity available in Ocaml?

The answer to Hashtable of mutable variable in Ocaml suggests not.

I'm loathe to attach serial numbers to states, since this is diagnostic code so any errors I make doing that have the potential to mask other bugs.

4

There are 4 answers

1
gasche On BEST ANSWER

If you are using the word "object" in the sense of OCaml's < ... > object types, then you can use Oo.id to get a unique integer identity for each instance. Otherwise the answer to "is there a general proxy for value identity" is "no". In this case my advice would be to start with Hashtbl.hash, evaluate whether it fits your need, and otherwise design your own hashing function.

You can also play with Hashtbl.hash_param (see documentation) to turn knob on value traversals during hashing. Note that the Hashtbl code uses linked lists for bucket of same-hash values, so having lots of hash conflicts will trigger linear search behavior. It may be better to move to other implementations using binary search trees for conflict buckets. But then again, you should evaluate your situation before moving to more complex (and with worse performances in the "good case") solutions.

1
Martin Jambon On

Like others, I think unique IDs are the way to go.

Unique IDs are not hard to generate safely. One solution is to use a so-called private record as follows. It prevents users of the module from copying the id field:

module type Intf =
sig
  type t = private {
    id : int;
    foo : string;
  }

  val create_t : foo: string -> t
end

module Impl : Intf =
struct
  type t = {
    id : int;
    foo : string;
  }

  let create_id =
    let n = ref 0 in
    fun () ->
      if !n = -1 then
        failwith "Out of unique IDs"
      else (
        incr n;
        !n
      )

  let create_t ~foo = {
    id = create_id ();
    foo
  }
end
1
Pierre Chambart On

Sorry for the ugly hack, but I made something like that some time ago.

The trick about that is to ensure that values won't be moved in memory after inserting in the table. There are two situations that can move values in memory: copy from the minor to the major heap and major heap compaction. That means that when you insert a value in the table, it must be in the major heap and between two operations on the table you must ensure that no compaction happened.

Checking that the value is in the minor heap can be done using the C function is_young, if it is the case, you can force the value to migrate to the major heap using Gc.minor ().

For the second problem, you can either completely deactivate compactions or rebuild the table on compactions. Disabling it can be done using

Gc.set { Gc.get () with Gc.max_overhead = max_int }

Detecting that a compaction happened can be done by comparing at each acces to the table the number returned by

( Gc.quick_stat () ).Gc.compactions

Notice that you must be disable the compaction before accessing the table. If you disable compaction you should also consider changing the allocation policy to avoid unbounded fragmentation of the heap.

Gc.set {(Gc.get ()) with Gc.allocation_policy = 1}

If you want something really ugly in old versions of OCaml (before 4.00) the compaction kept the value in the same order in memory, so you could implement a set or map based on physical address without worrying.

0
Jeffrey Scofield On

I've found it very tricky to use physical equality to do hashing. You certainly can't use something like the address of the value as your hash key, because (as you say) things get moved around by GC. Once you have a hash key, it seems like you can use physical equality to do comparisons as long as your values are mutable. If your values aren't mutable, OCaml doesn't guarantee much about the meaning of (==). In practical terms, immutable objects that are equal (=) can theoretically be merged into a single physical object if the OCaml compiler or runtime wishes (or vice versa).

When I work through the various possibilities, I usually end up putting a sequence number into my values when I need a unique id. As gasche says, you can use Oo.id if your values are actual OO-style objects.