FACTOID # 19: Cheap sloppy joes: Looking for reduced-price lunches for schoolchildren? Head for Oklahoma!

 Home Encyclopedia Statistics States A-Z Flags Maps FAQ About

 WHAT'S NEW

SEARCH ALL

Search encyclopedia, statistics and forums:

(* = Graphable)

Encyclopedia > Disjoint set data structure

Given a set of elements, it is often useful to break them up or partition them into a number of separate, nonoverlapping sets. A disjoint-set data structure is a data structure that keeps track of such a partitioning. A union-find algorithm is an algorithm that performs two useful operations on such a data structure: A binary tree, a simple type of branching linked data structure. ...

• Find: Determine which set a particular element is in. Also useful for determining if two elements are in the same set.
• Union: Combine or merge two sets into a single set.

Because it supports these two operations, a disjoint-set data structure is sometimes called a merge-find set. The other important operation, MakeSet, which makes a set containing only a given element (a singleton), is generally trivial. With these three operations, many practical partitioning problems can be solved (see the Applications section). Generally, a singleton is something which exists alone in some way. ...

In order to define these operations more precisely, we need some way of representing the sets. One common approach is to select a fixed element of each set, called its representative, to represent the set as a whole. Then, Find(x) returns the representative of the set that x belongs to, and Union takes two set representatives as its arguments.

Perhaps the simplest approach to creating a disjoint-set data structure is to create a linked list for each set. We choose the element at the head of the list as the representative. In computer science, a linked list is one of the fundamental data structures used in computer programming. ...

MakeSet is obvious, creating a list of one element. Union simply appends the two lists, a constant-time operation. Unfortunately, with this implementation Find requires Ω(n) or linear time with this approach. The Big O notation is a mathematical notation used to describe the asymptotic behavior of functions. ...

We can avoid this by including in each linked list node a pointer to the head of the list; then Find takes constant time. However, we've now ruined the time of Union, which has to go through the elements of the list being appended to make them point to the head of the new combined list, requiring Ω(n) time. The Big O notation is a mathematical notation used to describe the asymptotic behavior of functions. ...

We can ameliorate this by always appending the smaller list to the longer, called the weighted union heuristic. This also requires keeping track of the length of each list as we perform operations to be efficient. Using this, a sequence of m MakeSet, Union, and Find operations on n elements requires O(m + nlog n) time. To make any further progress, we need to start over with a different data structure.

## Disjoint-set forests

We now turn to disjoint-set forests, a data structure where each set is represented by a tree data structure where each node holds a reference to its parent node. Disjoint-set forests were first described by Bernard A. Galler and Michael J. Fisher in 1964,[1] although their precise analysis took years. In computer science, a tree is a widely-used computer data structure that emulates a tree structure with a set of linked nodes. ... In general, a reference is something that refers or points to something else, or acts as a connection or a link between two things. ...

In a disjoint-set forest, the representative of each set is the root of that set's tree. Find simply follows parent nodes until it reaches the root. Union combines two trees into one by attaching the root of one to the root of the other. One way of implementing these might be:

` function MakeSet(x) x.parent := null function Find(x) if x.parent == null return x return Find(x.parent) function Union(x, y) xRoot = Find(x) yRoot = Find(y) xRoot.parent := yRoot `

In this naive form, this approach is no better than the linked-list approach, because the tree it creates can be highly unbalanced, but it can be enhanced in two ways.

The first way, called union by rank, is to always attach the smaller tree to the root of the larger tree, rather than vice versa. To evaluate which tree is larger, we use a simple heuristic called rank: one-element trees have a rank of zero, and whenever two trees of the same rank are unioned together, the result has one greater rank. Just applying this technique alone yields an amortized running-time of O(log n) per MakeSet, Union, or Find operation. Here are the improved `MakeSet` and `Union`: In computational complexity theory, amortized analysis is the time per operation averaged over a worst_case sequence of operations. ...

` function MakeSet(x) x.parent := null x.rank := 0 function Union(x, y) xRoot = Find(x) yRoot = Find(y) if xRoot.rank > yRoot.rank yRoot.parent := xRoot else if xRoot.rank < yRoot.rank xRoot.parent := yRoot else yRoot.parent := xRoot xRoot.rank := xRoot.rank + 1 `

The second improvement, called path compression, is a way of flattening the structure of the tree whenever we use Find on it. The idea is that each node we visit on our way to a root node may as well be attached directly to the root node; they all share the same representative. To effect this, we make one traversal up to the root node, to find out what it is, and then make another traversal, making this root node the immediate parent of all nodes along the path. The resulting tree is much flatter, speeding up future operations not only on these elements but on those referencing them, directly or indirectly. Here is the improved `Find`:

` function Find(x) if x.parent == null return x x.parent = Find(x.parent) return x.parent `

These two techniques complement each other; applied together, the amortized time per operation is only O(α(n)), where α(n) is the inverse of the function f(n) = A(n,n), and A is the extremely quickly-growing Ackermann function. Since α(n) is its inverse, it's less than 5 for all remotely practical values of n [2]. Thus, the amortized running time per operation is effectively a small constant. In computational complexity theory, amortized analysis is the time per operation averaged over a worst_case sequence of operations. ... For other uses, see Big O. Big O notation or Big Oh notation, and also Landau notation or asymptotic notation, is a mathematical notation used to describe the asymptotic behavior of functions. ... In computability theory, the Ackermann function or Ackermann-PÃ©ter function is a simple example of a computable function that is not primitive recursive. ...

In fact, we can't get better than this: Fredman and Saks showed in 1989 that Ω(α(n)) words must be accessed by any disjoint-set data structure per operation on average. 1989 (MCMLXXXIX) was a common year starting on Sunday of the Gregorian calendar. ...

## Applications

Disjoint-set data structures arise naturally in many applications, particularly where some kind of partitioning or equivalence relation is involved, and this section discusses some of them. In mathematics, an equivalence relation, denoted by an infix ~, is a binary relation on a set X that is reflexive, symmetric, and transitive. ...

### Tracking the connected components of an undirected graph

Suppose we have an undirected graph and we want to efficiently make queries regarding the connected components of that graph, such as: This article just presents the basic definitions. ... In an undirected graph, a connected component or component is a maximal connected subgraph. ...

• Are two vertices of the graph in the same connected component?
• List all vertices of the graph in a particular component.
• How many connected components are there?

If the graph is static (not changing), we can simply use breadth-first search to associate a component with each vertex. However, if we want to keep track of these components while adding additional vertices and edges to the graph, a disjoint-set data structure is much more efficient. In graph theory, breadth-first search (BFS) is a graph search algorithm that begins at the root node and explores all the neighboring nodes. ...

We assume the graph is empty initially. Each time we add a vertex, we use MakeSet to make a set containing only that vertex. Each time we add an edge, we use Union to union the sets of the two vertices incident to that edge. Now, each set will contain the vertices of a single connected component, and we can use Find to determine which connected component a particular vertex is in, or whether two vertices are in the same connected component.

This technique is used by the Boost Graph Library to implement its Incremental Connected Components functionality.

Note that this scheme doesn't allow deletion of edges — even without path compression or the rank heuristic, this is not as easy, although more complex schemes have been designed that can deal with this type of incremental update.

### Computing shorelines of a terrain

When computing the contours of a 3D surface, one of the first steps is to compute the "shorelines," which surround local minima or "lake bottoms." We imagine we are sweeping a plane, which we refer to as the "water level," from below the surface upwards. We will form a series of contour lines as we move upwards, categorized by which local minima they contain. In the end, we will have a single contour containing all local minima.

Whenever the water level rises just above a new local minimum, it creates a small "lake," a new contour line that surrounds the local minimum; this is done with the MakeSet operation.

As the water level continues to rise, it may touch a saddle point, or "pass." When we reach such a pass, we follow the steepest downhill route from it on each side until we arrive a local minimum. We use Find to determine which contours surround these two local minima, then use Union to combine them. Eventually, all contours will be combined into one, and we are done.

### Classifying a set of atoms into molecules or fragments

In computational chemistry, collisions involving the fragmentation of large molecules can be simulated using molecular dynamics. The result is a list of atoms and their positions. In the analysis, the union-find algorithm can be used to classify these atoms into fragments. Each atom is initially considered to be part of its own fragment. The Find step usually consists of testing the distance between pairs of atoms, though other criterion like the electronic charge between the atoms could be used. The Union merges two fragments together. In the end, the sizes and characteristics of each fragment can be analyzed. [3] Computational chemistry is a branch of chemistry that uses the results of theoretical chemistry incorporated into efficient computer programs to calculate the structures and properties of molecules and solids, applying these programs to real chemical problems. ... In chemistry, a molecule is an aggregate of two or more atoms in a definite arrangement held together by chemical bonds [1] [2] [3] [4] [5]. Chemical substances are not infinitely divisible into smaller fractions of the same substance: a molecule is generally considered the smallest particle of a pure... Molecular dynamics (MD) is a form of computer simulation where atoms and molecules are allowed to interact for a period of time under known laws of physics. ... Properties In chemistry and physics, an atom (Greek á¼„Ï„Î¿Î¼Î¿Ï‚ or Ã¡tomos meaning indivisible) is the smallest particle of a chemical element that retains its chemical properties. ... Distance is a numerical description of how far apart things lie. ... Electron atomic and molecular orbitals In atomic physics and quantum chemistry, the electron configuration is the arrangement of electrons in an atom, molecule, or other physical structure (eg, a crystal). ... Electric charge is a fundamental conserved property of some subatomic particles, which determines their electromagnetic interactions. ...

## History

While the ideas used in disjoint-set forests have long been familiar, Robert Tarjan was the first to prove the upper bound (and a restricted version of the lower bound) in terms of the inverse Ackermann function. Until this time the best bound on the time per operation, proven by Hopcroft and Ullman, was O(log* n), the iterated logarithm of n, another slowly-growing function (but not quite as slow as the inverse Ackermann function). Tarjan and van Leeuwen also developed one-pass Find algorithms that are more efficient in practice. The algorithm was made well-known by the popular textbook Introduction to Algorithms.[4] Robert Endre Tarjan (born April 30, 1948 in Pomona, California) is a renowned computer scientist. ... In computer science, the iterated logarithm of n, written log*n, is the number of times the logarithm function must be iteratively applied before the result is less than or equal to 1. ...

Share your thoughts, questions and commentary here

Want to know more?
Search encyclopedia, statistics and forums:

Press Releases |  Feeds | Contact