Assembly Language - Wildcards (part 4)

Last month, we went over the changes that would be necessary in order to change our basic routine to search for unterminated strings. In the final article (at last!) in this series, we shall look at the calculations needed in order to print out the found string 'in context'.

Displaying the results

Once a string has been located, the aim is to display a neat 80-character extract from the text at the relevant position. Ideally, this will be centred around the found string, but if it occurs close to the beginning or end of a line this may not be possible.

Referring back to last month's article, you will see that we branch to .splitbyeighty directly from the main loop, having loaded a byte from file%, a byte from target%, and established that the entire length of the target string has successfully been matched.

.splitbyeighty
SUB    temp%,target_start%,target%
ADD    temp%,temp%,#1
SUB    length%,temp%,length%

The difference between the registers target_start% and target%, pointing at the start and end of the target string respectively, will give us the number of characters matched (after a small adjustment due to the effects of post-indexed addressing). We then take into account the additional text matched by the wildcards (if any) already stored in register length%. This register is now used to hold the total result in the form of a negative number. In the case of our example 'wildcard and the end of', this would be -23.

ADD    target%,file%,length%
SUB    file%,file%,#2

We only branch to splitbyeighty after we detect that target% is pointing to the string terminator character - in other words, the character after the last matching letter - and since the pointers are incremented after loading a byte, both pointers are now pointing two characters beyond the end of the string.

The register target% is no longer needed (it will be reset from target_start% before we continue the search), so it can here be employed for a different temporary purpose - searching backwards to locate the start of the paragraph in which the string has been found. By adding the negative offset in length% to file%, we point target% to an address just after the first matching character, before decrementing file% to point directly at the last character matched.

.findparastart
LDRB   char%,[target%,#-1]!
CMP    char%,#10
Bne    findparastart

Note the use here of pre-indexed addressing to locate the linefeed character - the offset (negative in this case) is applied to the pointer before the byte is loaded. The ! at the end of the instruction enforces 'writeback', ensuring that the pointer register is updated each time round the loop, in contrast to post-indexed addressing, where writeback is automatic. We use pre-indexed addressing in this search, to ensure that, when we finish, the register holds the address of the actual byte just loaded, rather than pointing on to the next potential candidate.

How to centre the string in the display

R9, or store%, is currently spare (it is only used during wildcard searches), so throughout this part of the program, we shall be using it to hold various temporary values. First of all, we need to calculate how many characters need to be added on to the end of the current 'found string'.

ADD    R9,length%,#82
MOV    R9,R9,ASR #1

If a string of length x characters is to be displayed in the centre of a line 80 characters long, elementary algebra tells us that we need to allow (80-x)÷2 extra characters at either end. Since we already have the length calculated as a negative offset, the necessary value can be obtained by a simple addition followed by a shift to the right. In practice, we use 82 rather than 80, in order to counteract the general rounding-down effect that takes place throughout the rest of our calculations.

SUB    char%,file%,target%
ADD    temp%,char%,length%
CMP    temp%,R9
RSBls  R9,char%,#80

Using both temp% and char% as temporary registers, we check how much space there is between the start of the found string and the start of the paragraph. If it is less than the default number of characters that would be needed to centre the string (R9), we compensate for this by recalculating R9 to allow extra space at the end of the extract instead, in order to make up the total length to 80 characters again.

In the diagram below, the dashed line represents a paragraph of text held in memory, with the found string shown near the start of the paragraph by 'xxxxx'. The register target% is currently pointing at the start of the paragraph; file% points at the end of the found string, char% holds the total distance between these two points, and temp% now holds the distance between the start of the paragraph and the start of the found string. As the diagram attempts to make clear, the default value calculated for R9 is such that, after adding this value to both ends of the found string, the length of the extract will be 80 characters.

The diagram illustrates the case when the found string is so close to the start of the paragraph (the dashed line) that there is no room to add R9 characters at the beginning, i.e. when temp% < R9. At the bottom is shown the new value of R9 necessary to be added at the end if a length of 80 characters starting from target% is still to be obtained; it should be evident from the diagram that this value is given by (80-char%).

Note that, in ARM assembler, we have to represent this as RSB R9,char%,#80, because of the rule that constant values can only appear on the right-hand side of the expression. SUB R9,#80,char% is not a legal statement!

Clipping to the nearest word boundary

.findparaend
LDRB   char%,[file%,#1]!
CMP    char%,#10
Beq    addfirsthalf
SUBS   R9,R9,#1
Bne    findparaend

We now proceed to count on by R9 characters, or until the end of the paragraph, whichever comes sooner. If the end of the paragraph is reached, we can jump over the next section in confidence that the selection has not ended in the middle of a word; otherwise we attempt to 'clip back' our extract to the nearest whole word for cosmetic reasons.

.clipback
LDRB    char%,[file%,#-1]!
CMP     char%,#ASC("z")
CMPmi   char%,#ASC("A")
Bpl     clipback

The technique used is very primitive. We simply move the pointer back until we come to a character which does not lie in the ASCII range 'A'-'z'. Note the use of the # character to inform the assembler that the result of the Basic ASC function is to be used as a constant.

.addfirsthalf
SUB     temp%,file%,target%
CMP     temp%,#80
SUBpl   target%,file%,#80
Bmi     writedata

Having established the end of our selection, it is relatively simple to move target% to point to the beginning. We left it pointing to the start of the current paragraph; all we need to do is check to see whether that was less than 80 characters away (in which case, we are obviously in the middle of a short paragraph and may as well give up and jump straight to .writedata at the end of the program).

.clipforward
LDRB    char%,[target%,#1]!
CMP     char%,#ASC("z")  ;char 
                         ;greater than "z"
CMPmi   char%,#ASC("A")  ;or char
                         ;less than "A"
Bpl     clipforward

Otherwise we set target% to (file%-80) and clip to the nearest non-alphabetic character as before.

Printing out the results

.writedata
ADD    target%,target%,#1
SUB    temp%,file%,target%
SWI    "OS_WriteN"
SWI    "OS_NewLine"
B      restart

At this point, target% is either pointing to the ASCII 10 character at the start of the paragraph or to the non-alphabetic character (in practice, probably a space) immediately preceding the extract we plan to display. Either way, we need to move it on by one byte to point to the first letter of the string to be printed.

With target% (R0) now pointing to the start of the string, all that remains to be done is to calculate the final length and store the result in temp% (R1) before calling "OS_WriteN". This handy SWI enables unterminated strings of a given length - sections out of the middle of a text file, for instance! - to be printed.

Finally, we output a newline sequence and branch back to .restart, just before the main program loop. The registers length% and target% will be reinitialised and return to their old functions, and the search will continue from the current position of file% at the end of the text just displayed. One line will be printed out for every matching string found.

Known bugs

In addition to the limitations mentioned in Archive 14.7 p18, this version of the program introduces a new and more serious bug. Due to the simplistic way the search has been coded, strings starting with a single-character wildcard '?' have a 50% chance of failing to be found (the initial wildcard will always match while the second letter of the target string will normally fail; thus, in effect, the routine will only be checking for the start of the string at every other letter). Meanwhile, strings starting with the multiple-character wildcard '*' stand a fair chance of crashing the program by setting up a constant loop between .wildcardM and .repeat which bypasses the normal end-of-file check! Given that, by its nature, the search now always looks for substrings, there is no longer any need to start the target string with '*'; however, it is obviously a good idea to make sure any leading wildcards supplied by the user are stripped off before such a string is passed to the machine code!

Real-life use

On the monthly disc, as well as a sample text file and a working version of the assembler listing, is the latest version of !Textseek, a fast text search utility based around the principles we have just covered - although somewhat more sophisticated. (It has a WIMP front-end, uses throwback to display the results and allows you to jump directly to the right point in any file, provides 'match whole words only' and 'case insensitive' options, takes a good shot at circumventing the bug mentioned above, permits you to select a list of up to ten filetypes in order to narrow down the recursive search, and automatically detects and copes with non-RISC OS line endings and Basic line numbering.)

To be honest, the wildcard support is probably the least useful feature... however, full source code for both the Basic and machine code sections is included as always for those who want to see how it works. The program can also be downloaded from my site along with my other software.


Source: Archive Magazine 14.10
Publication: Archive Magazine
Contributor: Harriet Bazley