Saturday, November 04, 2006

External Sorting

C++ External Sorting

What is external sorting is the first thing that comes into mind. As the name signifies, it must be related to sorting and in fact it is.

External signifies that the storage for the data is to be external to the memory where it could be loaded. For example, in files.

It becomes a requirement if the data to be sorted is huge enough to not load at once in memory and hence restricts us from applying the common sorting algorithms (like quicksort) to them.

The idea behind this technique is to divide and rule. The whole set of data, say in a file, is put into several different files that can individually be loaded fully into memory. Since, they can be read completely into memory we can sort them. But this doesn’t immediately bring us to the completely sorted data as we are not sure if the largest in the first file is smaller than the smallest of the next file.


Still that is the first step to the solution – sorting the individual files.

Let us take a simpler case and build up on that assuming the restriction that we have is not on memory but count of values that can be read at once.

The algorithm that’s used can be something like this:

(Pseudo code)

Step 1:
Read limitValue number of values from input file and
use std::sort to sort them. Write the sorted result
into an intermediate file. Repeat this process for
(int) (TotalObjects / limitValue) + 1 times.
There would get that many number of intermediate files created.

Step 2:
Read first
((limitValue / ((int) (TotalObjects / limitValue) + 1)) - 1 )
numbers from each intermediate file.

Step 3:
Pick the first
(limitValue - the above count of objects in distinct batches)
smallest (or largest) values from those.
This logic stops if you got that many objects or exhaust
the objects in any batch.

If (objects in a particular batch exhausted)
If (no more objects in the intermediate file left)
Re-adjust the batch count etc and
proceed ignoring that intermediate file)
Else
Fill it up with the next batch count
of values from intermediate file it belonged to
End if
End if

If (output buffer i.e. limitValue -
the above count of objects in distinct batches gets full)
Write to the output file
(final sorted objects output file)
and repeat step 3 iteratively
End if

Step 4: The output file should contain the sorted objects.

In the above Pseudo code:

limitValue = the upper limit of the count of values that can be read at once

TotalObjects = total number of values in the input file (> limitValue else sorting can be done in one go!)

Sometime later we shall try out a sample implementation. Till then have fun!

No comments: