Assembly Language - Wildcards (part 3)

Last month, I gave an example of how to carry out a wildcarded search of a block of strings held in memory. In the final two parts of this series, I shall demonstrate how to adapt our basic ARM code routine to perform a task with slightly different requirements - to scan an entire text file for a given (wildcarded) string and print out the results in context.

The most important difference to be borne in mind is that the data to be checked will not, in this case, consist of relatively short CR-terminated strings, but of long sections of text separated by linefeeds (CHR$ 10). For the purposes of this program, the multiple wildcard character '*' will match any character except a linefeed; in other words, strings split across more than one paragraph cannot be matched. This seems a reasonable restriction.

The output will be in the form of single lines of text not more than 80 characters wide (scrolls nicely in a task window), centred around the found string. This implies that this string itself cannot be more than 80 characters in length, which might be a problem if we ever actually want to find a very long string; however, the most likely scenario for this to occur would be if a wildcard matched a much larger block of text than expected, when it would almost certainly not be the string the user was looking for in any case.

Setting up in Basic

In order to search a file using this routine, the whole file has to be loaded into a block of memory. In a real-life situation, a programmer would measure the file size and then extend the program's Wimpslot or create a dynamic area of the required size; however, for simplicity's sake, we shall simply DIM a block from Basic into which a test file may be loaded. I'll provide a suitable file, but you can use any text file you like.

searchfor$="wild*ing"
DIM find% 40:$find%=searchfor$
REM enter search string here
name$="ADFS::4.$.testfile"
REM alter filename as required
SYS"OS_File",23,name$ TO found%,,,, len%
IF found%=0 THEN ERROR 255,"File not found"
REM get file length
DIM block% len%+2
SYS "OS_File",16,name$,1+block%,0
REM load file into block
?block%=10
?(block%+len%+2)=10

To keep the machine code simple, I have defined paragraphs as sections of text between two LF (CHR$ 10) characters. For the first and last paragraphs, this may not be the case so we take the precaution of making the block two bytes longer than necessary, loading the file into block%+1 and poking an extra LF in at the beginning and end of the file. This isn't necessary for the actual search, but it prevents random data from neighbouring sections of memory being accidentally printed out when we are scanning for the 'context' of a string found in the first or last lines.

PROCassemble
F%=block%+1
REM R5 points to start of file
H%=block%+len%+2
REM R7 points to end of file
G%=find%
REM R6 points to search string
PRINT name$:REM display name of file being searched
CALL ourcode%
END

Since the machine code itself will be handling the formatting and the output of the results (covered in next month's article), almost all we have to do here is assemble and call it. There are only three parameters - the address of the start of the file in memory, the address of its end and the address of the search string - and, as before, we use the special variables F%, G% and H% to place the values directly into the relevant registers from Basic.

Setting up the registers

In our PROCassemble, we shall 'name' registers as usual by using Basic variables.

file%=5:char%=4
temp%=1:end%=7
target%=0
target_start%=6
file_backup%=2
old%=10
f_ptr%=12:t_ptr%=3
store%=9:length%=8

Eagle-eyed readers may notice that, while most of the register names (and functions) are similar to those of previous articles, many of them have ended up being assigned to different registers! The only vital assignments are target% and temp%, which need to be R0 and R1 respectively when the SWI "OS_WriteN" is called at the end of the program.

The main comparison loop

The basic principles of the search remain much the same as in the original routine described in Archive 14.7 p15: load a byte from each pointer in turn, check to see if the end of the target string has been successfully reached - if not, first check for the presence of wildcard characters, then compare the two bytes to see if they are the same. However, there is one major difference. The search no longer ends as soon as two characters fail to match; we simply reset the target pointer and continue searching for the target string until the end of the file is reached.

.restart
MOV    target%,target_start%
MOV    file_backup%,file%
MOV    length%,#0

Here we point target% to the start of the target string, initialise file_backup%, and zero the length counter. Later on, we branch back to .restart in order to restart the search.

.search
LDRB   char%,[file%],#1
LDRB   temp%,[target%],#1
CMP    temp%,#13        ;check for end of
                        ; target string
Beq    splitbyeighty    ;if found, we
                        ; have a match
CMP    temp%,#ASC("*")  ;if "*" jump
                        ; to wildcard routine
Beq    wildcardM    

Note that we are jumping directly to the labels in question rather than using a BL instruction. At the end of these routines, we will then jump straight back to the start of the loop - in other words, if either of these conditions are met the remainder of the loop will not be executed.

CMP      temp%,#ASC("?")    ;if "?"
              ; skip this comparison
CMPne   char%,temp%
MOVne   target%,target_start%
; restore pointer to start of target
; string if chars don't match
ADDne   file_backup%,file_backup%,#1
MOVne   file%,file_backup%

If the match fails, we reset the target pointer as before - but since, when searching a whole file we can no longer be certain exactly where the start of the string we want is, we also now need to reset the file pointer to the character following the last unsuccessful match. For example, if we are looking for the sequence of characters 'muring' and we come across the word 'murmuring', it is not until the second 'm' of the initial 'murm' has been loaded that we will be able to tell that the start of the word will not match; but unless we then reset the pointer nearer to the start of the word, we would subsequently check only the letters 'uring' instead of 'urmuring', thus missing the fact that a successful match should have followed the failed one. The role of file_backup% is to keep track of the start of the last unsuccessful match.

CMP    file%,end%  ;check for end of file
Blt    search
MOV    R15, R14    ;exit routine

Multiple wildcard handling

.wildcardM
MOV     old%,target%
.skipfixed
LDRB    temp%,[target%],#1
CMP     temp%,#ASC("?")
CMPne   temp%,#ASC("*")
CMPne   temp%,#13
Bne     skipfixed
            ; scan ahead for next
            ; wildcard/end of string

The basic code remains much as it was described in Archive 14.7 p16; we identify the 'fixed' section in the target string (up to the next wildcard or to the end of the string), and scan ahead in the data held in memory to locate this sequence of characters if possible. Note that we can no longer use the shortcut that checks to see whether an 'end stub' - the section of the target string between the last wildcard and the end of the string - matches the end of the search data, since it is perfectly valid for this string to be found in the middle of a paragraph. For example, in the sentence above, the target string 'wild*of' would match the phrase reading 'wildcard and the end of'; but jumping to the end of the paragraph and checking to see if the last two characters were 'o' and 'f' wouldn't be very helpful!

SUB    f_ptr%,file%,#1
;allow wildcard to match zero characters
.reset_target
ADD    store%,f_ptr%,#1
MOV    t_ptr%,old%
  .scanloop
  LDRB    temp%,[t_ptr%],#1
  LDRB    char%,[f_ptr%],#1
  CMP    temp%,char%
  Beq    scanloop
CMP    t_ptr%,target%
Beq    matched
;whole of fixed section matched - success!
CMP    char%,#10   ;LF
CMPne  temp%,#13   ;CR
Beq    jumpback    ;hit end of one
                   ;string only -
                   ;search failed
MOV    f_ptr%,store%
; move field pointer to character
; after first match
B      reset_target

Just as before, we set up temporary pointers, t_ptr% and f_ptr%, and use two nested loops to scan ahead character by character until we either locate the desired 'fixed section', namely 'of' in our example (success - jump out of loop) or hit either the end of the target string or the end of the line/paragraph (failure - jump to end of section). Note that the target string is CR-terminated but that lines in RISC OS text files are LF-terminated!

How many characters matched?

We need to keep track of how many characters the "*" wildcard actually matched, in order to work out how long the final string to be displayed will be. Yet another register, length%, is thus needed in order to keep track of this number - not only will we need to access it later in a different part of the program, but it is also possible that a given target string may contain more than one wildcard and the value will thus need to be accumulated over several calls to .wildcardM.

.matched
SUB    temp%,store%,file%
SUB    temp%,temp%,#1
ADD    length%,length%,temp%

Thanks to the use of temporary pointers, file% still has the same value that it did before we began to scan ahead; it is pointing to the start of the section we skipped, which reads 'card and the end'. Meanwhile store%, which has been incremented every time around the outer loop, is now pointing to the start of the 'fixed section' we have successfully located, the 'o' of 'of', and f_ptr%, which is incremented every time around the inner loop, and is pointing to the first character after the end of the 'fixed section'.

(In reality, due to the use of post-indexed addressing, which increments the pointer after loading a byte from that address, all three registers are actually pointing one character further along; but since only their relative values, i.e. the number of characters between the pointers, are of interest here, I have chosen to ignore this for the purposes of simplifying the explanation.)

It should now be apparent that by subtracting file% from store%, we can discover the number of characters matched by the wildcard. What we ultimately want to know is how many characters longer than the target string the overall result will be; thus we subtract 1 from this value in order to allow for the length of the wildcard itself.

RSBS     temp%,length%,#80-LEN(searchfor$)
SUBpl    target%,t_ptr%,#1
SUBpl    file%,f_ptr%,#1

Here, we check that the total length matched so far is still small enough to fit in 80 characters, taking into account the additional length of the target string, and being careful to use the S suffix this time to set the status flags. Note that this is a somewhat lazy way of doing it, using the Basic assembler and assuming that the code is to be assembled 'live' immediately before being run - otherwise, it would be necessary to pass in the length of the string currently being searched for as a separate parameter.

A positive result indicates that the found string is short enough to be a reasonable match. We set the main pointers to point to the last characters successfully matched (bear in mind the post-indexed addressing!) in preparation for jumping back to the main comparison loop.

.jumpback
Ble    restart
B      search

There are three possible routes by which this point may be reached during the execution of the program, two of which indicate that the search failed and the other that a successful wildcard match was made. In true assembler fashion, we can work it out by the condition of the status flags. Either we branched here directly when the search for the 'fixed section' failed (zero flag set), or we have just executed the instructions at label matched, in which case either the match failed because the wildcarded section was too long (negative flag set) or else it has succeeded (negative flag clear). We use the condition Less than or equal to in order to detect either of the first two possibilities and branch back to restart.

Otherwise, we are clear to continue the search. If we have reached the end of the target string, this will be detected as soon as the next two characters are loaded, causing an immediate branch to the label splitbyeighty. This final section of the program is entirely concerned with the cosmetic considerations of producing useful output; next month's article will cover it in detail.


Source: Archive Magazine 14.9
Publication: Archive Magazine
Contributor: Harriet Bazley