Applications of Heap

The obvious application (and the original motivation) of heap is heapsort:

Total time: \(O(n \log n)\). Note that even if you don’t know heapify and replace it with \(n\) pushes, it doesn’t change the overal complexity.

However, heaps can also be used in many other much more interesting scenarios in algorithm design. Here we showcase some classical examples.

\(k\)-way mergesort

We all know mergesort uses binary division. But what if we divide the array \(k\) ways? And then recursively mergesort \(k\) subarrays, and finally combine \(k\) sorted (sub)lists. This is known as \(k\)-way mergesort, a very interesting generalization of the classical mergesort which is a special case with \(k=2\).

Again, like classical mergesort, combine is where most of the work lies. Ok, let’s generalize the “two-pointers” idea to “\(k\)-pointers”. But while comparing two numbers is trivial, how do you compare \(k\) numbers and take the smallest one? If you take \(O(k)\) time every step, that would be too slow. Notice that most numbers remain unchanged (only the best number in the previous round is replaced by its successor in that sublist), so you waste much time doing repeated comparisons.

So we use a heap instead!

So the total time for the combine step is \(O(k + n\log k)=O(n\log k)\) because \(k < n\).

Caveat: Notice that if \(k \geq n\), our \(k\)-way mergesort becomes heapsort! So the two extreme special cases of \(k\)-way mergesort are classical mergesort (\(k=2\)) and heapsort (\(k=n\)).

Now divide+combine (the non-recursive parts, or the work at each level) is \(O(k+n\log k)=O(n\log k)\). Now we get back to the overall time using recurrence:

\[ T(n) = k T(n/k) + O(n\log k)\]

We’ll still use the “recursion tree” method to expand the recursion:

This is a remarkable result that the runtime of \(k\)-mergesort does not depend on \(k\)!

Alternative method: Using the Master Method, we can see that the above recurrence falls into case 2, so \(T(n)=O(n\log n)\).

Team selection problem

Another problem similar to \(k\)-way mergesort is the team selection problem: the United States have \(n=50\) states, and each state has selected its (sorted) top \(k\) tennis players. Now we need to select the top \(k\) players to form team USA (for Olympics). How would you do that as fast as possible?

Same as \(k\)-way mergesort, you build an initial heap of size \(n\) from the best players of each state. Then you pop/push (or heapreplace), until you have popped \(k\) players.

Time: \(O(n + k\log n)\), because the heap size is bounded by \(n\).

Can you make it even faster? Well, if \(k \ll n\), a key observation is that the vast majority of states will have no representatives on team USA. If a state’s best player can’t make the top \(k\) in the initial heap of size \(n\), then every player from that state doesn’t have a chance in team USA. For example, if \(k=5\), team USA will likely have (even multiple) players from big states like California and New York, and nobody from most other states. This observation suggests we should narrow down our initial heap to just the top \(k\) (best among the best) players (or states) from the \(n\) top players from each state. So we use quickselect to select the \(k\)th best player among those \(n\) leaders and scan all those \(n\) leaders again to select the top \(k\) leaders. Now you build an initial heap of just \(k\) players, and because the heap size is bounded by \(k\), you improve the total time to:

\[ O(n + k + k\log k)=O(n+k\log k)\]

which is slightly faster than \(O(n+k\log n)\).

\(n\)-best pairs problem

A slightly more involved problem is \(n\)-best pairs problem. Given two unsorted lists \(A\) and \(B\), each with \(n\) integers, their cross-product (or Cartesian product) is \(n^2\) pairs:

\[ A\times B = \{ (x, y) \mid x \in A, y \in B \} \]

How to select the \(n\) smallest pairs from \(A\times B\)? Let’s say we compare pairs by their sums:

\[ (x,y) < (x',y') \text{ iff. } x+y < x'+y' \text{ or } x+y==x'+y' \text{ and } y<y' \]

i.e., between two pairs, the one with the smaller sum is considered smaller, or in the case of a tie, the pair with smaller second dimension wins (actually you can define this relation arbitrarily, as long as it’s monotonic). For example:

>>> a, b = [4, 1, 5, 3], [2, 6, 3, 4]
>>> nbest(a, b) 
[(1, 2), (1, 3), (3, 2), (1, 4)]

Let’s start with the most obvious idea, and gradually improve it.

Caveat: if a successor is already in the heap, don’t push it twice. This means you need some hash-based datastructures such as Python set to check whether some pair is already pushed, in \(O(1)\) time.

Here is a picture:

nbest problem: heap (PQ) is the frontier

You can imagine in a flooding zone, water level keeps rising. Initially, water will only cover the top-left corner (lowest area) and gradually cover more and more cells. Those covered in water are already popped from the heap, and the “waterfront”, i.e., the frontier of expansion, is the current heap, which marks the boundary between those already popped and those never pushed (dry area). In the end, you can see that among \(n^2\) cells, most are never explored (not even computed), i.e., in the dry area, and only \(n\) are popped, i.e., submerged in water, and \(n\) are in the frontier. That’s why this algorithm is so efficient.

Alternative method: Instead of starting with just the top-left corner (\((a_0, b_0)\)), you can also start with all the first column \(\{(a_0, b_0), (a_1, b_0), \ldots, (a_{n-1}, b_0)\}\), and then you just need to pop/push (or heapreplace) instead of pop one and push two. Note that this method is much more similar to team selection (each \(a_i\) is a “state”, with its sorted best players being \((a_i, b_0), (a_i, b_1), \ldots\)). In this case, a does not need to be sorted (but b must be sorted; or vice versa if you start with the first row). The other small advantage is that you don’t need to maintain a set to check if some pair is already pushed. Total time: \(O(n\log n + n + n\log n) = O(n\log n)\); the first term is sort b, the second is heapify, and the third is \(n\) heappops. Same runtime, just not as pretty (or symmetric) as the above method, but may be a bit easier to implement.

\(k\)-smallest numbers in a datastream

Historical Notes

The \(n\)-best problem is taken from my \(k\)-best parsing paper (Huang and Chiang, 2005).