Is it right to conform Hashable by only taking id into consideration?

3.7k views Asked by At

A lot of online example I have came across, when they try to conform to Hashable, they only take id as consideration. For instance https://www.raywenderlich.com/8241072-ios-tutorial-collection-view-and-diffable-data-source , https://medium.com/@JoyceMatos/hashable-protocols-in-swift-baf0cabeaebd , ...

/// Copyright (c) 2020 Razeware LLC
/// 
/// Permission is hereby granted, free of charge, to any person obtaining a copy
/// of this software and associated documentation files (the "Software"), to deal
/// in the Software without restriction, including without limitation the rights
/// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
/// copies of the Software, and to permit persons to whom the Software is
/// furnished to do so, subject to the following conditions:
/// 
/// The above copyright notice and this permission notice shall be included in
/// all copies or substantial portions of the Software.
/// 
/// Notwithstanding the foregoing, you may not use, copy, modify, merge, publish,
/// distribute, sublicense, create a derivative work, and/or sell copies of the
/// Software in any work that is designed, intended, or marketed for pedagogical or
/// instructional purposes related to programming, coding, application development,
/// or information technology.  Permission for such use, copying, modification,
/// merger, publication, distribution, sublicensing, creation of derivative works,
/// or sale is expressly withheld.
/// 
/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
/// THE SOFTWARE.

import UIKit

class Video: Hashable {
  var id = UUID()
  var title: String
  var thumbnail: UIImage?
  var lessonCount: Int
  var link: URL?
  
  init(title: String, thumbnail: UIImage? = nil, lessonCount: Int, link: URL?) {
    self.title = title
    self.thumbnail = thumbnail
    self.lessonCount = lessonCount
    self.link = link
  }
  // 1
  func hash(into hasher: inout Hasher) {
    // 2
    hasher.combine(id)
  }
  // 3
  static func == (lhs: Video, rhs: Video) -> Bool {
    lhs.id == rhs.id
  }
}

I was wondering, is that ever a correct way to conform Hashable? I thought we should take all class member variables, into consideration?

For instance, by using only id in func hash/ func ==, will yield the following misbehaviour.

We are going encounter 2 objects with different content, but func == will return true when comparing 2 objects with different content.

struct Dog: Hashable {
    let id = UUID()
    var name: String
    var age: Int
    
    init(name: String, age: Int) {
        self.name = name
        self.age = age
    }

    func hash(into hasher: inout Hasher) {
        hasher.combine(id)
    }

    static func == (lhs: Dog, rhs: Dog) -> Bool {
        lhs.id == rhs.id
    }
}


var dog0 = Dog(name: "dog", age: 1)
var dog1 = dog0

/*
 dog0 is -5743610764084706839, dog, 1
 dog1 is -5743610764084706839, dog, 1
 compare dog0 with dog1 is true
 */
print("dog0 is \(dog0.hashValue), \(dog0.name), \(dog0.age)")
print("dog1 is \(dog1.hashValue), \(dog1.name), \(dog1.age)")
print("compare dog0 with dog1 is \(dog0 == dog1)")


dog1.name = "another name"
dog1.age = 9

// Same id, but different content!

/*
 dog0 is -5743610764084706839, dog, 1
 dog1 is -5743610764084706839, another name, 9
 compare dog0 with dog1 is true
 */
print("dog0 is \(dog0.hashValue), \(dog0.name), \(dog0.age)")
print("dog1 is \(dog1.hashValue), \(dog1.name), \(dog1.age)")
print("compare dog0 with dog1 is \(dog0 == dog1)")

I was wondering, is it right to conform Hashable by only taking id into consideration?


p/s

I try to look from other languages like Java, on what is the general advice regarding hash code generation . This is what is being written in their popular Effective Java book.

Do not be tempted to exclude significant fields from the hash code computation to improve performance. While the resulting hash function may run faster, its poor quality may degrade hash tables’ performance to the point where they become unusable. In particular, the hash function may be confronted with a large collection of instances that differ mainly in regions you’ve chosen to ignore. If this happens, the hash function will map all these instances to a few hash codes, and programs that should run in linear time will instead run in quadratic time. This is not just a theoretical problem. Prior to Java 2, the String hash function used at most sixteen characters evenly spaced throughout the string, starting with the first character. For large collections of hierarchical names, such as URLs, this function displayed exactly the pathological behavior described earlier.

6

There are 6 answers

0
Dávid Pásztor On

Whether it is correct or not to only use a subset of properties for a Hashable conformance completely depends on your requirements.

If for a certain object, equality is really only defined by a single variable (or a subset of variables), than it is correct to use that subset of variables for the Hashable (and Equatable conformances).

However, if all properties of a type are required to decide whether two instances are equal or not, then you should use all properties.

0
Dhawal On

This is true. your code has one requirement for hashable, it compare only dog.id == dog1.id when you use dog == dog1.

if you want to check all field of struct then compare that field in == method.

static func == (lhs: Dog, rhs: Dog) -> Bool {
    lhs.id == rhs.id && lhs.name == rhs.name && lhs.age == rhs.age
}
15
Rob Napier On

TL;DR: This hash function is unnecessary, but legal, and arguably ideal. This == is incorrect, despite being common in tutorials, because it breaks substitutability which is required by Equatable, exactly as you suggest.

However, as matt notes, diffable data sources may require this anyway. That doesn't make it good, but it may make it necessary. (Do read all of matt's comments below. They provide a lot of important context. In reference specifically to diffable data sources, see his answer; I am not particularly familiar with diffable data sources.)


I suggest turning to the documentation, which lays this out.

First, Hashable:

Hashing a value means feeding its essential components into a hash function, represented by the Hasher type. Essential components are those that contribute to the type’s implementation of Equatable. Two instances that are equal must feed the same values to Hasher in hash(into:), in the same order.

The most important thing is that Hashable be consistent with Equatable. Two things must never be equal, but have different hashes.

The converse is not true. It is completely valid for two unequal things to have the same hash. In fact, that's a fundamental fact of hashing called the pigeonhole principle. A good hash improves performance by avoiding unnecessary equality checks. But the following hash(into:) function is always valid:

func hash(into hasher: inout Hasher) {
    hasher.combine(0)
}

This just means that every value has the same hash, and so the system will always call ==. This is bad for performance (and in server applications that can translate into a denial of service attack called hash flooding). But it is legal.

If that's legal, certainly just hashing id is legal.

But....

That brings us to Equatable and its docs, and the most important paragraph (emphasis added):

Equality implies substitutability—any two instances that compare equally can be used interchangeably in any code that depends on their values. To maintain substitutability, the == operator should take into account all visible aspects of an Equatable type. Exposing nonvalue aspects of Equatable types other than class identity is discouraged, and any that are exposed should be explicitly pointed out in documentation.

A value must only be considered equal if they can be substituted for each other in any context, and it will not impact the correctness of the program. Clearly in your example, that's not true. In fact, it's never going to be true for a type with mutable public properties (despite many tutorials that get this wrong). So your == is incorrect. But your hash function is fine, arguably ideal. Its goal is to be a quick check for non-equality that minimizes collisions. If the ids are the same, you still have to check the rest of the values, but if they're different, you know it's not going to be equal.

If your Dog type were immutable (name and age were let rather than var), it might be acceptable to implement == this way. It's impossible to set the id by hand, so it would be impossible to get two values with the same id but different values. But I wouldn't do that unless you could show a significant performance boost. It hangs correctness on too subtle a requirement. For example, if an extension added an init that allowed setting id directly, it would make your == invalid. That's too fragile IMO.

How about private mutable state? As long as that is only for performance purposes (memoization/caching), then it's fine to leave out of == (and hash). But if that internal state can influence externally visible behavior, than it needs to be part of ==.

The good news is, most of the time you don't need to worry. Swift's automatic implementations handle this for you correctly out of the box, and compare all properties. So in your Dog example, the best solution is to just remove the methods (I'm sure you're aware of that; just stating it for folks reading along). Whenever possible, I highly recommend using the default conformances for Hashable and avoid writing your own.

But in cases where you have to implement your own, the rules are simple:

  • Two equal values must be perfectly substitutable in all cases without impacting correctness (though a substitution may impact performance)
  • Two equal values must always have the same hash

The guidelines are also fairly simple: Hashing should be fast, while minimizing collisions.


The one argument I've seen for these incorrect implementations of == is to try to make Set work nicely. IMO, this is a misuse of Set and Equatable, and is not promised to work in expected ways (if you insert a duplicate value with the same identifier, but different properties, it's undefined which of the values will be in the collection). You should not twist Equatable around wanting to use a specific data structure. You should use the data structure that matches your meaning.

In the common case, the right tool is Dictionary as [ID: Value]. It expresses what you really mean: a mapping between an ID and a single value for that ID, rather than an unordered bag of unique values.

There is likely a memory cost in using a Dictionary rather than a Set (since you have to duplicate the ID). But you should only try to work around that after proving there's a problem to be solved.


Also, see matt's comment below. I have not spent a lot of time with the new diffable data sources. I remember when I first saw them that I was concerned they might be misusing of Equatable. If that's true, then you may have to misuse Equatable to use them, and that would explain some tutorials that do it this way. That doesn't make it good Swift, but it may be required by Apple frameworks.


As I've studied Apple's code more (see matt's answer for many), I've noticed that they all follow the rule I discussed above: they are immutable and you cannot set the UUID during init. This construction makes it impossible for two values to have the same id but other values be different, so checking the id is always sufficient. But if you make the values mutable, or you allow the id to be anything other than let id = UUID(), then this construction becomes dangerous.

0
Sven On

That is completely fine. There is only one requirement for Hashable: If a == b then a.hashValue == b.hashValue must also be true. This is fulfilled here, so your struct will work as a dictionary key or as a set member.

Note that this also is fulfilled, if your hash(into:) doesn’t combine any data (or only constant data) into the hasher. This will make hash table lookups slow, but they will still work.

Another option is to compare all fields in your == implementation but only use a subset of them for hashing in hash(into:). That still follows the rules (the other way around is not allowed of course). This may be useful as a performance optimization, but it also may hurt performance. Depends on the distribution of the data you are hashing.

0
matt On

It is fine to have a type with multiple properties, including a UUID, where the conformance to Hashable and Equatable depends solely on the UUID and not on any of the other properties. Apple uses this pattern in their own code. Download Apple's example code from here:

https://docs-assets.developer.apple.com/published/6840986f9a/ImplementingModernCollectionViews.zip

Look at the WiFiController.Network struct, the MountainsController.Mountain struct, the OutlineViewController.OutlineItem class, and the InsertionSortArray.SortNode struct. They all do exactly this same thing. So, all of this code is by Apple:


struct Network: Hashable {
    let name: String
    let identifier = UUID()

    func hash(into hasher: inout Hasher) {
        hasher.combine(identifier)
    }
    static func == (lhs: Network, rhs: Network) -> Bool {
        return lhs.identifier == rhs.identifier
    }
}

struct Mountain: Hashable {
    let name: String
    let height: Int
    let identifier = UUID()
    func hash(into hasher: inout Hasher) {
        hasher.combine(identifier)
    }
    static func == (lhs: Mountain, rhs: Mountain) -> Bool {
        return lhs.identifier == rhs.identifier
    }
    func contains(_ filter: String?) -> Bool {
        guard let filterText = filter else { return true }
        if filterText.isEmpty { return true }
        let lowercasedFilter = filterText.lowercased()
        return name.lowercased().contains(lowercasedFilter)
    }
}

class OutlineItem: Hashable {
    let title: String
    let subitems: [OutlineItem]
    let outlineViewController: UIViewController.Type?

    init(title: String,
         viewController: UIViewController.Type? = nil,
         subitems: [OutlineItem] = []) {
        self.title = title
        self.subitems = subitems
        self.outlineViewController = viewController
    }
    func hash(into hasher: inout Hasher) {
        hasher.combine(identifier)
    }
    static func == (lhs: OutlineItem, rhs: OutlineItem) -> Bool {
        return lhs.identifier == rhs.identifier
    }
    private let identifier = UUID()
}

struct SortNode: Hashable {
    let value: Int
    let color: UIColor

    init(value: Int, maxValue: Int) {
        self.value = value
        let hue = CGFloat(value) / CGFloat(maxValue)
        self.color = UIColor(hue: hue, saturation: 1.0, brightness: 1.0, alpha: 1.0)
    }
    private let identifier = UUID()
    func hash(into hasher: inout Hasher) {
        hasher.combine(identifier)
    }
    static func == (lhs: SortNode, rhs: SortNode) -> Bool {
        return lhs.identifier == rhs.identifier
    }
}
0
MH175 On

Your suspicion is correct. The question of whether it's right (as you put it) is a matter of the domain. Excellent explanations have been given about the technicalities of hashing and equality.

Since your question has touched several times on DiffableDataSource you should be very cautious about designing your domain logic to suit the needs of the UI framework. Doing so violates the dependency inversion and open-closed principles.

Create a local data structure to use as the data source's item identifier, and copy only the properties you intentionally decide should trigger a reload.


typealias DataSource = UICollectionViewDiffableDataSource<Int, DogItem>

struct DogItem: Hashable {
    var name: String // the cell will reload whenever the name changes
}

vs

struct DogItem: Hashable {
    var id: Dog.ID // the cell will never change until explicitly told
}

As was mentioned, a common solution is to use a dictionary that maps an id to a value:

var items = [UUID:Item]()

However this might allow you to map the wrong value:

var item1 = Item(id: UUID())
var item2 = Item(id: UUID())
items[item1.id] = item2

Instead, create a reusable hashing wrapper data structure to explicitly use the identifier.

struct IDMap<T: Identifiable & Hashable> {

    private var _items = [T.ID : T]()

    init() { }

    subscript(index: T.ID) -> T? {
        get { return _items[index] }
    }
    
    mutating func insert(_ item: T) {
        _items[item.id] = item
    }
    
    mutating func remove(_ item: T) {
        _items.removeValue(forKey: item.id)
    }
    
    var items: Set<T> {
        return Set(_items.values)
    }


}