BigSort

Goal: Sorting a list of N values when they all fit in RAM is easy. Here we study BigSort for a list of N values, when only m (<< N) values fit in RAM.The values are stored in a big file of 'M' disk pages. Each page has upto `k' values.

Why does it matter? In big data problems, we often want to sort 100s of GBs to Terabytes and Petabytes of data, with only Gigabytes of RAM. BigSort is a fundamental building block for efficiently working with big data.

Example Input: Table with 400 values (data rows). These are stored in big file of 58 pages with 7 rows per page. Constraint: RAM can fit B=4 pages (out of 58 pages) at a time

Output: Sorted table of 58 pages with 7 rows per page

Algorithm

  1. Split big file into many small files of B pages each
  2. Sort each small file (in RAM). Write sorted output below.
  3. Merge B sorted files into bigger files. Repeat until done.

Notes:

  1. We use light blue for small values and dark blues for big values (along with the values in text), to make it easy to follow along. For example, notice the input table has a mix of light and dark blues in random spots. The final sorted output has the values organized from light to dark.
  2. Only B*7 = 28 rows fit in RAM, and written to output at a time. (Minor detail for the keen eyed: we need a little bit of extra space to write.)