Binary Search

Sunday 04 November 2012

A basic and common problem is finding an element in a sorted array. If nothing is known about the distribution of values then the best you can do is an O(log n) search, and the most common is the binary search.
The idea is to split the array in two, then check which half the element you're after is in. Repeat until you find the value. Although simple to understand, binary search has subtleties that often lead to errors in implementation.
Go's standard library implements a binary search that takes a function for examining the array elements as an argument. If the element you're looking for is not found in the array, the index where it should be inserted is returned. This allows the function to be as general as possible, since determining whether the element exists is left to the caller. This also means that all decisions in the code are two-way (< or >=), rather than three-way (<, >, ==), which may improve performance.
Implementing your own (for whatever bizarre reason, although I'll accept "performance") is a good exercise in careful programming. Let's search for a specific integer in a sorted []int:
func isPresent(a []int, target int) bool{
Instead of concentrating on "finding a value equal to the target", let's shift our focus to "the first value in the array greater than or equal target". Obviously if the target is there, this will be the same thing, and if it's not there we'll find something else! This also gives us the advantage that it will return the first value, rather than an arbitrary one, and removes the terrible prospect of doing a linear time search for the first such value (which defeats the whole point of a binary search).
    low := 0
    high := len(a)
We define the limits of our search as zero and the length of our array. Note that this upper bound is not a valid index (since indices start at zero, the highest valid index is length-1). So whatever values that we do access must not equal the upper bound. In Go this would cause a panic, but in C or C++ the behaviour is undefined, possibly leading to all kinds of difficult to diagnose bugs. The lower bound is inside the array.
    for low < high {
Now we loop until the bounds converge. We can be sure that if we exit this loop normally, then low >= high. Of course, we still want to make sure that we definitely exit and don't go into an infinite loop.
        mid := low + (high-low)/2
Ah, the midpoint. Here we pick an index somewhere between our two bounds. We carefully avoid integer overflow (which is well defined in Go, but not what we want). high-low must be positive (because we survived the previous line's loop condition), and dividing it by 2 makes it non-negative. So mid >= low. Dividing a positive number by 2 makes it strictly smaller, so we can say high > mid. Together, we can say low <= mid < high and the midpoint never equals the upper bound!
        if a[mid] < target {
            low = mid+1
We test the midpoint. Remember we're looking for something greater than or equal. If the midpoint is less than the target we can exclude it from the search, and since the lower bound is an index that is accessible, we move it just past the midpoint.
        } else {
            high = mid
Contrast that to what happens here! The midpoint was found to be >= the target, which means that it potentially is the target! We set the upper bound to the midpoint, but no lower.
We can now see that each time round the loop we decrease the distance between the bounds. When the if clause is executed the lower bound must increase (since mid >= low). When the else clause is executed the upper bound must decrease (since mid < high). Therefore the loop must terminate at some point.
    return low < len(a) && a[low] == target
We have to check that we're not accessing our upper bound, which would happen if we were looking for a target greater than any in the array. This also guards against empty arrays. Then we check if the index we've found actually contains the target. If it doesn't it represents where it should be inserted.
Binary search makes a good example of the benefits of careful thinking. Beware of bugs in the above code...

Previous (Logging Levels)
Next (Using Package Heap)