C for Yourself - part 33

Sorting through your data - Steve Mumford starts to explain how

Sooner or later, and it's probably going to be sooner, you'll find yourself writing a program that has to organise its data in some specific way. This could be arranging names and addresses into alphabetical order, or a list of numerical results into ascending order.

Usually this will involve sorting groups of data - records - according to one field. For instance, a simple address book program would sort its records, containing such elements as telephone numbers and street names according to the alphabetic position of the surname. In my parachute database, the individual records would be arranged according to their dates.

In BASIC, things are fairly straightforward. Numbers and strings could be compared with the same operators and because arrays were the only easy method of storage, if a whole record needed moving into a new location all the array elements with the same index would just require copying. It is a little easier if an array of structures in C are being used as a whole record can be copied in one go - the main point to note is that strings can't be directly compared using the normal numeric operators, there's a separate function to provide the necessary tools.

It's defined in the string.h standard library, so remember to #include it before starting. strcmp() takes pointers to two strings and returns an integer value depending upon the results of the comparison - if the strings are identical, the function returns zero. Otherwise, the value returned is either positive or negative to indicate which of the two strings comes 'first'.

int result = 0;
char a[] = "Hello";
char b[] = "World";

result = strcmp(a, b);
printf("The result was: %d", result);

In this case, a negative value is returned, indicating that string a, "Hello", lies before string b, "World".

I mentioned above that it was easy to copy blocks of data around in C when they're in the form of structures. This is true, but when using linked lists to hold the data, the situation gets a little more complicated. When dealing with an array of structures, predefined slots are used that have their own number, irrespective of whether that slot actually contains an information record.

Linked lists are somewhat more amorphous. And in order to remove a structure from the list it's necessary to do a little work on either side of the record in question so that the list isn't broken. The simplest way to alter the sequence of a list is to change where the links point to for each of its members - think of the list as a large telephone switchboard, the connectivity can be altered by pulling out a wire from one socket and pushing it into another. Now you might be able to see one of the reasons why I decided to give each of the list members two links, one pointing to the previous member in the chain and one to the next - in this way, when swapping the links around, it's easier to work your way up and down the list.

The dreaded bubblesort

The easiest - and normally the slowest - method of sorting is to step through the data gathered and look at each of the consecutive data pairs. Starting with elements one and two, decide whether they need to be swapped to approach the desired list order and perform the change if necessary. Moving on to look at elements two and three, the same comparison is carried out - and so on, until th end of the list is reached. In this way, the 'heaviest' element will sink to the bottom of the list in the first pass.

We've not finished yet - this technique needs to be repeated over the list several times until all the members are in position, but the range of the sort can be truncated, each time knowing that the last element in the previous sort is in the right position. To avoid unnecessary repetition, a flag can be initialised at the start of each cycle and set to TRUE if a swap has been made. Once the routine completes a cycle where the flag remains unset, we know that none of the data pairs have been swapped and the list is now in the correct order.

This method is horrendously time-consuming for large datasets, and other more complex methods exist to minimuse the time taken - more on those next time.


Source: Acorn User - 177 - January 1997
Publication: Acorn User
Contributor: Steve Mumford