I was wondering, why in the union-find algorithm - you merge two trees by their height - attaching the smaller one to the taller one (in the simpler variant without path compression).
Would it be a worse approach if you merged them by their count of elements - attaching the tree with less elements to the tree with more elements?
It's safe to merge either by height or by number of nodes.
When merging by height, a tree with n nodes in it will have height at most O(log n). To see this, note that at least one node is required to get a tree of height zero. From there, the only way to get a tree of height h + 1 is to merge together two trees of height h. This means that the number of nodes required to get height h, which we'll denote by N(h), is given by
This recurrence solves to N(h) ≥ 2h, which means that with n total nodes, we would get a maximum possible height of about log n. This upper bound is tight, since if you start with 2h independent trees and repeatedly pairwise merge them, you'd produce a single tree of height h.
When merging by number of nodes, we can similarly show that a tree of n nodes will have height at most O(log n). We can use a similar argument. Let M(h) denote the minimum number of nodes necessary to create a tree of height h when merging by number of nodes. Clearly, M(0) = 1. So, what would have to happen to create a tree of height h + 1? This would require merging together two trees of height h together, and so we get that
This, as before, solves to M(h) ≥ 2h, so not only is the upper bound also O(log n), it's the exact same upper bound we had before. Consequently, it's safe to union either by height or by number of nodes.
So why use height rather than number of nodes? One argument is to consider the number of bits of information required to store the height information. If each node stores the number of nodes below it, we need O(log n) bits per node to write out the size information. On the other hand, if each node only stores the height it's at, a number which is O(log n), we only need O(log log n) bits per node to write everything out, which is exponentially less space. Moreover, the optimization of using path compression and union-by-rank is specifically designed to use height information rather than size information, so without repeating the (frighteningly hard) runtime analysis we can't conclude whether we still get the same time bound.
Hope this helps!