Assembly Language - Wildcards (part 1)

In my final two articles, I shall look at a simple example of a string comparison routine in assembler. We have already seen how it's possible to scan the contents of a Basic string array from inside an ARM code routine; however, if you don't require the full flexibility of Basic's string handling elsewhere in your program, it's not only faster but also much easier to search CR-terminated strings which have been stored in a block at fixed intervals using the $block="data" construction. In this article, I shall demonstrate a routine which will compare a string stored in this way against a wildcarded target and return a true/false value to indicate a match.

The wildcard values ('?' to match any single character and '*' to match zero or more characters) are currently hard wired in. It would be possible to supply them as Basic variables at assembly-time, i.e. CMP char%,single%, although the ARM code would, of course, have to be re-assembled if they subsequently changed.

Simple comparison

The first thing we do is to set up the register names in Basic.

target%=1:         REM points to wildcarded string we are trying to find
field%=6:          REM points to search field, i.e. area of memory we 
                       are searching
char%=0:           REM holds characters loaded from target%
temp%=3:           REM holds characters loaded from field%
match%=2:          REM holds TRUE or FALSE flag
t_ptr%=5:f_ptr%=4: REM alternative pointers
old%=10:store%=7:  REM temporary stores for values

The main loop of the program will look very similar to the character-by-character comparison we used for searching entries in the string array. However, if the characters don't match, we then check for a wildcard in our target string. A single-character wildcard counts as an automatic match whatever the contents of the other register, so we branch back. If we encounter a multiple-character wildcard we jump to the wildcardM routine in order to deal with it. (Note that assembler programming is not structured programming - those who find GOTOs abhorrent should probably cover their eyes now and read no further...)

.scantarget
   LDRB   char%,[target%], #1
   LDRB   temp%,[field%], #1

We are using post-indexed addressing here, so target% and field% are automatically incremented each time we load a character.

CMP char%,#13
BEQ testfieldend
       ;if end of target string
       ;check for end of data also
CMP char%,temp%
BEQ scantarget  ;else if chars same,
                ;compare next chars

CMP char%,#63   ;is it a query?
BEQ scantarget  ;if so, pretend it matched
CMP char%,#42   ;is it an asterisk?
BEQ wildcardM   ;branch to special routine
                ;otherwise, comparison failed
.return
MOV R0,match%   ;routine returns
                 (match%)
MOV R15, R14    ;exit routine

The label .return is our universal return point. We jump to here from any point in the code when we wish to abort the search, and we can put any 'clean-up code' in here that needs to be carried out before we drop back into Basic. In this particular case, no branch is necessary; we simply drop through the scan loop and return the value of the register match% (which is used to flag success) in R0. We initialise this register to zero before calling the machine code, and returning this default value signifies that the match has failed.

Returning results to Basic

There are various ways of returning useful values to the parent Basic program, most of which involve a shared block of data. This demonstration routine has been written to make use of Basic's USR function, which calls a block of machine code and returns with the value in R0 on exit. All we have to do is make sure that R0 contains the value we wish to pass back to Basic - in this case TRUE or FALSE.

Checking for the string terminator

.testfieldend
   FNmessage("Testing end of field") ;debugging data
   CMP    temp%,#13
   MVNEQ  match%,match%   ;match%=NOT match% i.e. TRUE
   B      return

We branch here when we detect the end of the target string. If we have also reached the end of the field, then the strings match; otherwise this is only a substring (i.e. we have found 'car' in 'cart') and the match has failed. To signal a successful match, we use the MVN instruction (MoVe Negative) to perform a logical NOT on the contents of the register, transforming zero (FALSE) to -1 (TRUE). Either way, we can then branch to .return to exit and supply the value of match% as a return parameter to Basic.

Note that we are assuming that both strings are CR-terminated.

Multiple-character wildcards

The most complicated part of the routine is the part that has to deal with wildcards matching any number of characters or none. What we have to do is to determine the length of the next 'fixed' portion of the target string - that is, the section immediately following the wildcard, up until the next wildcard or the end of the string - scan ahead in our memory data until we find it, and continue our normal search from there. If we can't find such a string, we know the match has failed.

For example, if the target string is DE*RATION and we were checking against a string DETERIORATION, we would need to skip the letters T, E, R, I and O until we found the fixed string RATION. If, however, we were checking against DETERMINATION we would check every remaining letter in the word without success and then report failure. In a further twist, if we checked against DELIBERATIONS, the wildcard would match successfully but the overall search would later fail.

Determining the fixed string

.wildcardM
   FNmessage("scan for end of fixed string")
   MOV   old%,target%

The first thing we do is to store the position of our target pointer in a spare register before we start scanning ahead.

.skipfixed
   LDRB    char%,[target%],#1
   CMP     char%,#63
   CMPNE   char%,#42
   CMPNE   char%,#13
   BNE     skipfixed ; scan ahead for
                     ; next wildcard
           ;or end of string

We then search the rest of our target string for the end of the 'fixed' section.

   CMP     char%,#13   
   BEQ     checkendstub

If we find we have reached the end of the target string without encountering another wildcard, we branch to a shortcut routine dealing with this case. This will be by far the most common situation so it is worth having a special check for it - when you call a search routine repeatedly from within a loop in another machine-code program, the speed-increase more than justifies the extra size.

Otherwise, we use a slower, more general routine that scans every character in the search field until it identifies the 'fixed string' between two wildcards.

Scanning ahead for a match

In this section of the program, we will use the registers t_ptr% and f_ptr% as temporary alternative data pointers in lieu of target% and field%, which are currently pointing respectively to the end of the 'fixed' portion of the target string and to the next unchecked character in the other string.

   FNmessage("Scanning for section between wildcards")
   SUB   f_ptr%,field%,#1

First, we move our temporary pointer back one place. This allows the wildcard to match zero characters if necessary, e.g. DR*AW??? can match DRAWING. We will proceed to move one character at a time along the search field, scanning ahead to see if each is the start of the string we are looking for, using two nested loops.

.reset_target
   ADD      store%,f_ptr%,#1
   MOV      t_ptr%,old%

At the start of the outer loop, we set up various registers. Each time around this loop, we want to restore t_ptr% to point to the start of the fixed section we are trying to match, and move the start of the search one position further along the other string. If we were looking for the RAT of DE*RAT* in DELIBERATION, we would move f_ptr% first to the L, then to the I, then to the B, while restoring t_ptr% to the R of DE*RAT* after every failed comparison.

.contloop
   LDRB     char%,[t_ptr%],#1
   LDRB     temp%,[f_ptr%],#1
   CMP      char%,temp%
   BEQ      contloop

In the inner loop we use a straight string comparison to try to match the fixed section.

   CMP      t_ptr%,target%
   SUBEQ    target%,t_ptr%,#1
   SUBEQ    field%,f_ptr%,#1
   BEQ      scantarget

target% is still pointing to the end of the fixed section in the target string. Once the characters no longer match, the first thing to check is whether t_ptr% has reached the end of the fixed string yet or not. If it has, this indicates a successful wildcard match; we reset target% and field% from the temporary pointers, decrementing them to allow for the fact that we have been using post-indexed addressing, and jump back to the start of the program.

CMP    temp%,#13
CMPNE  char%,#13
BEQ    return ;hit end of one string
       ; only - whole search failed

The next thing to check is whether we have hit the end of one of the strings before matching the fixed section - if so, we branch straight to .return. As match% is still FALSE, this will automatically signal failure.

MOV   f_ptr%,store%
  ;move field pointer to character
  ; after first match
B     reset_target

Otherwise, we branch back to the start of the outer loop. Note that since this isn't structured programming, we have perfect freedom to jump out of the middle of loops!

Special case: checking the end stubs

This routine is called when a wildcard is encountered with a 'fixed end stub' following it - in other words, when there are no more wildcards between this one and the end of the target string, like our old example DE*RATION. In this special case, we can short-cut the 'searching for fixed section' process. We still have no idea how many characters we need to skip before the start of the fixed section, but we do know where it ends; so we can just jump to the end of the string and count backwards, leaving out the whole slow outer loop, potentially giving a massive speed increase, especially since this situation is likely to crop up quite often.

This part of the program is only ever called at the end of a comparison; we never return to the .scantarget loop. Success or failure here will cause the whole search to succeed or fail!

.checkendstub
FNmessage("checking end stubs")
MVN   match%,#0     ;NOT(0) =TRUE
MOV   R5,#1
SUB   store%,target%,old%

The first thing we do is to set match% to TRUE instead of FALSE - the aim of this is to allow us to branch straight to .return on a successful match, saving an instruction inside a loop later on. R5 is to be used as a counter and so we initialise it. (To avoid confusion we write R5 and not its alias t_ptr% here, because we are just using it as a spare register and not as a pointer.)

old% was set up right at the start of the wildcard handling and holds the old position of target%. target% was then used to scan along the target string to find the end of the fixed section. The difference between the two gives us the length of the end stub we are looking for.

.findend
LDRB   temp%,[field%],#1
CMP    temp%,#13
ADD    R5,R5,#1  ;find out how many
                 ;characters
BNE    findend   ;were left

CMP    R5,store%
BLO    nomatch   ;if R5 is less, end
                 ;stub cannot match 
         ;(no room for it!)

We now quickly scan ahead, using R5 to count how many characters we have found, until we get to the end of the search field. This allows us to make a quick check to eliminate strings that obviously don't match - if we are searching for DE*RATION, we can discard DEFT without even knowing what the last two letters are, on the simple grounds that RATION won't fit into two letters.

.stubloop   
LDRB   char%,[target%,-store%]
LDRB   temp%,[field%,-store%]   
SUBS   store%,store%,#1
BEQ    return  ;entire length of
               ;stubs matched - jump
               ;straight to exit!

CMP    char%,temp%
BEQ    stubloop  ;check that all 
         ;chars are the same

We now count backwards from the end, comparing characters as we go, until either we have managed to match the entire length of the fixed section (store%=0) or we find characters that don't match. Note that here we are using pre-indexed addressing (thus the values of target% and field% remain the same each time around the loop), with -store% as an offset. It's not only possible to use negative numbers as offsets e.g. [field%,#-4], but also negative registers!

.nomatch
MOV  match%,#0  ;else match%=FALSE
FNmessage("stubs do not match")
B    return

Finally, if the match failed, we reset match% appropriately and jump straight to the exit.

How to call the routine from Basic

The address of the block holding the target (wildcarded) string must be placed in target% (R1) and the address of the block holding the search data in field% (R6). We also need to initialise match% (R2) to zero, or FALSE. We can then call our machine code at whatever address it has been assembled.

B%=block1%:G%=block2%:C%=0
answer%=USR(code%)

The Basic variable answer% will now hold either TRUE or FALSE depending on the result of the match.

Known bugs

The wildcard logic isn't terribly sophisticated and it is possible to fool it with complex queries. For example, ?? or ** work as expected, as does ?* which matches at least one letter or more, but *? does not. Can you work out why? (Hint: it's to do with the length of the 'fixed section'.)

Also, T*AN?ATION fails to match TRANSPLANT-ATION, because the program takes the first AN it can find after the wildcard and treats the subsequent failure to match as definitive, whereas properly speaking it should then go back to see if there is a correct AN later on.


Source: Archive Magazine 14.7
Publication: Archive Magazine
Contributor: Harriet Bazley