C for Yourself - part 32

Steve Mumford discusses the decisions behind building a file format

Over the last month or so, we've built up the graphical routines necessary for informing the WIMP that we want to save a file - we've created save boxes and draggable icons and have a method of finding out where the user wants our program to dump the output. As a side effect, the loading of files can be accommodated fairly easily; the appropriate WIMP messages can be handled in the same routine that deals with the end-of-drag event.

Now comes the next big step - we have to decide what format we use for storing the data on disc. This is influenced heavily by the data formation used in the memory of the machine - some of the arguments used there will crop up again here.

You'll recall that we used a linked list of structures to hold the bulk of the program's data - this gave us flexibility in many ways, including the fact that we were effectively only limited by the amount of memory in the machine. We've already seen that with this feature comes a price - a lot more housekeeping is required when compated to using an array of predetermined size. I made some compromises by making the fields within the structure a fixed length; although in some ways prohibitive, this made the creation of the structures and their memory allocation much easier to handle.

One of the first things we have to build into our file is a simple data header - at the very least there should be a string of bytes in there identifying the nature of the file. Of course, you can use Acorn's filetype system to go some way towards automating this, but it makes sense to put in a safety check.

It's also a good idea to include the version number of the program that created the file, so if you update your masterpiece to such an extent that file formats are no longer compativle you'll need to be able to sieve out and deal with older files. That might be all the header contains, but in our situation we'll be including some extra information such as the number of records in the file and a vague indication of where they appear.

Several factors work against each other in this decision - the first is ease of programming, the second is speed of operation and the third is efficiency. Now that memory's pretty cheap and hard discs are two a penny, the third member of the list isn't so important, but it's still worth bearing in mind. You will be able to appreciate how much easier it is to dump a structure out to disc as a block of memory, rather than wading through each of the structures and saving out every field by hand.

Ease of programming is one thin, but if you ever have to write a database application that must deal with particularly large datasets, it's vital to consider the impact those might have on the algorithms you use. Plenty of situations can require the manipulation of gigabytes of data; logging the demands placed upon a Web server can easily produce this amount. When the volume of information on disc can exceed the physical RAM capacities of the machine, you have to start planning your file format very carefully, building it in a way so that it's easy to read and write to on the fly.

Searching is an important aspect as you don't really want to have to wade through two gigabytes of data every time you want to find a particular record. There are well-documented techniques of coping with these situations - for instance, hashing tables can be used to store a record within a file at a position that can be determined by the contents of that record.

Fortunately, the subject matter of our database doesn't really require that sort of extreme treatment. Although I aim to mention more of those techniques later, I don't intend to implement them here and now. We'll deal with the manipulation of data entirely in memory and save the file as a number of blocks of raw data, this figure being entered in the file header to simplify reloading.

It's been a long haul up to the program's present state, so for a change, next month, I'll look at a few of the more common functions that programmers find themselves having to develop, such as sorting and indexing routines. I'll try to describe the principles behind some of them and provide implementations to add to the Acorn User library.


Source: Acorn User - 176 - Christmas 1996
Publication: Acorn User
Contributor: Steve Mumford