CSC152 2005S, Class 51: Faster Sorts

Admin:
* Friday: Lab

Overview:
* Applying divide and conquer techniques
* Strategy one: Merge sort
* A variant: Bottom-up merge sort
* Analysis
* Strategy two: Quicksort
* Analysis

/Questions/

* How many steps does it take to shift elements right?
  * Sam's claim: O(n) steps
  * Even though we're using only one vector command, the underlying implementation is O(n)
* If we were using linked nodes, would insertion sort be faster (because we can insert in O(1))?
  * Insertion sort:
    Repeatedly insert an element into the correct place in the sorted portion
  * How do we get the next element to insert?  We use a cursor.  It only takes one step to advance the cursor.
  * Because we're using linked nodes, it takes O(n) to find the correct place to insert the value.  Whoops.

/Your Question for Today: How might you apply 'divide and conquer' to the problem of searching?/

* Typical divide-and-conquer strategy:
  * Divide the data into two parts
  * Solve the problem for each part
  * Combine the solutions
* Divide-and-conquer for binary search of elements indexed lb to ub in an array
  * Divide the data into two parts lb to mid, mid to up
  * Solve the problem for each part (if goal value belongs in the first half, recurse on the first half; "it's not there" is the solution for the second half)
  * Combine the solutions
    * use the solution from whichever half you used

 * Divide-and-conquer for "sort elements indexed from lb to ub"
  * Divide the data into two parts (lb to mid, mid to ub)
  * Solve the problem for each part (sort each half)
  * Combine the solutions


[ 02 10 07 03 10 05 08 05 05 06 03 10 10 06 ]
  split
[ 02 10 07 03 10 05 08 ] [ 05 05 06 03 10 10 06 ]
  sort each half
[ 02 03 05 07 08 10 10 ] [ 03 05 05 06 06 10 10 ]
  compare the first element of each sorted subarray and grab the smaller
[ 02 ] [ 03 05 07 08 10 10 ] [ 03 05 05 06 06 10 10 ]
  compare the first element of each sorted subarray and grab the smaller
[ 02 03 ] [ 05 07 08 10 10 ] [ 03 05 05 06 06 10 10 ]
  compare the first element of each sorted subarray and grab the smaller
[ 02 03 03 ] [ 05 07 08 10 10 ] [ 05 05 06 06 10 10 ]

Note: Improvement: Instead of grabbing the "smaller", when the two are equal, grab both

If we grab the one from the first half first, we get a stable sort!
* Don't apply the improvement

This new algorithm is called "merge sort"

One problem: Requires extra memory for the place into which you merge elements

/Another way to think of merge sort/

* Merge each pair of neighboring elements to make sorted sublists of size 2 (ignore left-over singleton, if there is one)
Original
[ 02 10 ] [ 07 03 ] [ 10 05 ] [ 08 05 ] [ 05 06 ] [ 03 10 ] [ 10 06 ]
"Merged"
[ 02 10 ] [ 03 07 ] [ 05 10 ] [ 05 08 ] [ 05 06 ] [ 03 10 ] [ 06 10 ]
* Merge each pair of pairs to make sorted sublists of size 4
[ 02 03 07 10 ] [ 05 05 08 10 ] [ 03 05 06 10 ] [ 06 10 ]
* Merge each pair of quartets to make sorted sublists of size 8
[ 02 03 05 05 07 08 10 10 ] [ 03 05 06 06 10 10 ]
* And so on and so forth
[ 02 03 03 05 05 05 06 06 07 08 10 10 10 10 ]

Same general idea as the top-down merge sort, but no recursion (whoo)
No recursion means:
* Easier for non-mathematicians to analyze
* Potentially faster

Running time:

Let t(n) be the running time of the original merge sort on a list of n elements

t(n) = cn // for split
     + 2*t(n/2) // for sorting each half
     + dn // for merging
t(1) = 1
t(0) = 1

t(n) is in O(nlog_2n)

Easier analysis:
* Consider the bottom-up merge sort
* Each stage (merging sublists of size k to sublists of size 2k) takes O(n)
* At each stage, we double the size of the sorted sublists
* So it's O(n) * "the number of times it takes to double 1 until you get n"
  That ugly thing in quotes is defined as log_2(n)
* Merge sort is O(nlog_2(n))

Comparing merge sort and heap sort
* Both are O(nlogn)
* What are the advantages of one over the other?
  * Heap sort probably requires less memory
  * A well-implemented merge sort is stable, heap sort is not

Observation: The biggest cost in merge sort is the merge
* Are there ways we can "split" the list/vector to make merging faster?
* Great idea for C.A.R. Hoare: Split the list or vector into "small values" and "large values" (every small value <= every large value)

Quicksort
  * Divide the data into two parts (small and large)
  * Solve the problem for each part (recursively quicksort)
  * Combine the solutions (lists: glue; arrays: done)
Problem: How do you separate into small and large?
* Need to step through the elements saying "is this small or is this large?, and swapping when it's on the wrong side"
  [                                        ]
       *                              *
  O(n) to swap
* How do you know whether it's small or large?
  * Solution 1: Find the median
  * Hoare's great solution: Pick a random element and cross your fingers
 
* The mean is not necessarily a good idea:
  1 10 100 1000 10000 100000 10000000
* The midpoint of the range is similarly not so good

Worst case: Always choose the smallest or largest value as "median"
O(n^2)
Best case: Always choose the median correctly
O(nlog_2(n))
Average case: Works pretty well

In practice: Quicksort is faster than any of the other sorts we've studied

Notes:
* For all the good sorting algorithms (except Quicksort), the name describes the key idea
* Insertion sort: Repeated insertion
* Selection sort: Repeated selection
* Merge sort: Repeated merge
* Heap sort: Use a heap
* Quicksort: Divide and conquer with randomness!

Other sorting algorithms:
* Bucket sort: Given restricted range of values
  Create an array of "buckets", one per value
  Scan your list, identifying which it bucket each value belongs in
  Scan your buckets, pulling the values out
O(n + m), where m is the number of buckets

* Radix sort: Given a binary representation
  * Put the things with a zero in the rightmost bit before those with a one in the rightmost bit
  * Put the things with a zero in the next-torightmost bit before those with a one in the next-to-rightmost bit
  * Continue as before
O(n*numberofbitsinrepresentation)