In my final two articles, I shall look at a simple example of a string comparison routine in assembler. We have already seen how it's possible to scan the contents of a Basic string array from inside an ARM code routine; however, if you don't require the full flexibility of Basic's string handling elsewhere in your program, it's not only faster but also much easier to search CR-terminated strings which have been stored in a block at fixed intervals using the $block="data" construction. In this article, I shall demonstrate a routine which will compare a string stored in this way against a wildcarded target and return a true/false value to indicate a match.
The wildcard values ('?' to match any single character and '*' to match zero or more characters) are currently hard wired in. It would be possible to supply them as Basic variables at assembly-time, i.e. CMP char%,single%, although the ARM code would, of course, have to be re-assembled if they subsequently changed.
target%=1: REM points to wildcarded string we are trying to find field%=6: REM points to search field, i.e. area of memory we are searching char%=0: REM holds characters loaded from target% temp%=3: REM holds characters loaded from field% match%=2: REM holds TRUE or FALSE flag t_ptr%=5:f_ptr%=4: REM alternative pointers old%=10:store%=7: REM temporary stores for values
The main loop of the program will look very similar to the character-by-character comparison we used for searching entries in the string array. However, if the characters don't match, we then check for a wildcard in our target string. A single-character wildcard counts as an automatic match whatever the contents of the other register, so we branch back. If we encounter a multiple-character wildcard we jump to the wildcardM routine in order to deal with it. (Note that assembler programming is not structured programming - those who find GOTOs abhorrent should probably cover their eyes now and read no further...)
.scantarget LDRB char%,[target%], #1 LDRB temp%,[field%], #1
We are using post-indexed addressing here, so target% and field% are automatically incremented each time we load a character.
CMP char%,#13 BEQ testfieldend ;if end of target string ;check for end of data also CMP char%,temp% BEQ scantarget ;else if chars same, ;compare next chars CMP char%,#63 ;is it a query? BEQ scantarget ;if so, pretend it matched CMP char%,#42 ;is it an asterisk? BEQ wildcardM ;branch to special routine ;otherwise, comparison failed .return MOV R0,match% ;routine returns (match%) MOV R15, R14 ;exit routine
The label .return is our universal return point. We jump to here from any point in the code when we wish to abort the search, and we can put any 'clean-up code' in here that needs to be carried out before we drop back into Basic. In this particular case, no branch is necessary; we simply drop through the scan loop and return the value of the register match% (which is used to flag success) in R0. We initialise this register to zero before calling the machine code, and returning this default value signifies that the match has failed.
.testfieldend FNmessage("Testing end of field") ;debugging data CMP temp%,#13 MVNEQ match%,match% ;match%=NOT match% i.e. TRUE B return
We branch here when we detect the end of the target string. If we have also reached the end of the field, then the strings match; otherwise this is only a substring (i.e. we have found 'car' in 'cart') and the match has failed. To signal a successful match, we use the MVN instruction (MoVe Negative) to perform a logical NOT on the contents of the register, transforming zero (FALSE) to -1 (TRUE). Either way, we can then branch to .return to exit and supply the value of match% as a return parameter to Basic.
Note that we are assuming that both strings are CR-terminated.
For example, if the target string is DE*RATION and we were checking against a string DETERIORATION, we would need to skip the letters T, E, R, I and O until we found the fixed string RATION. If, however, we were checking against DETERMINATION we would check every remaining letter in the word without success and then report failure. In a further twist, if we checked against DELIBERATIONS, the wildcard would match successfully but the overall search would later fail.
.wildcardM FNmessage("scan for end of fixed string") MOV old%,target%
The first thing we do is to store the position of our target pointer in a spare register before we start scanning ahead.
.skipfixed LDRB char%,[target%],#1 CMP char%,#63 CMPNE char%,#42 CMPNE char%,#13 BNE skipfixed ; scan ahead for ; next wildcard ;or end of string
We then search the rest of our target string for the end of the 'fixed' section.
CMP char%,#13 BEQ checkendstub
If we find we have reached the end of the target string without encountering another wildcard, we branch to a shortcut routine dealing with this case. This will be by far the most common situation so it is worth having a special check for it - when you call a search routine repeatedly from within a loop in another machine-code program, the speed-increase more than justifies the extra size.
Otherwise, we use a slower, more general routine that scans every character in the search field until it identifies the 'fixed string' between two wildcards.
FNmessage("Scanning for section between wildcards") SUB f_ptr%,field%,#1
First, we move our temporary pointer back one place. This allows the wildcard to match zero characters if necessary, e.g. DR*AW??? can match DRAWING. We will proceed to move one character at a time along the search field, scanning ahead to see if each is the start of the string we are looking for, using two nested loops.
.reset_target ADD store%,f_ptr%,#1 MOV t_ptr%,old%
At the start of the outer loop, we set up various registers. Each time around this loop, we want to restore t_ptr% to point to the start of the fixed section we are trying to match, and move the start of the search one position further along the other string. If we were looking for the RAT of DE*RAT* in DELIBERATION, we would move f_ptr% first to the L, then to the I, then to the B, while restoring t_ptr% to the R of DE*RAT* after every failed comparison.
.contloop LDRB char%,[t_ptr%],#1 LDRB temp%,[f_ptr%],#1 CMP char%,temp% BEQ contloop
In the inner loop we use a straight string comparison to try to match the fixed section.
CMP t_ptr%,target% SUBEQ target%,t_ptr%,#1 SUBEQ field%,f_ptr%,#1 BEQ scantarget
target% is still pointing to the end of the fixed section in the target string. Once the characters no longer match, the first thing to check is whether t_ptr% has reached the end of the fixed string yet or not. If it has, this indicates a successful wildcard match; we reset target% and field% from the temporary pointers, decrementing them to allow for the fact that we have been using post-indexed addressing, and jump back to the start of the program.
CMP temp%,#13 CMPNE char%,#13 BEQ return ;hit end of one string ; only - whole search failed
The next thing to check is whether we have hit the end of one of the strings before matching the fixed section - if so, we branch straight to .return. As match% is still FALSE, this will automatically signal failure.
MOV f_ptr%,store% ;move field pointer to character ; after first match B reset_target
Otherwise, we branch back to the start of the outer loop. Note that since this isn't structured programming, we have perfect freedom to jump out of the middle of loops!
This part of the program is only ever called at the end of a comparison; we never return to the .scantarget loop. Success or failure here will cause the whole search to succeed or fail!
.checkendstub FNmessage("checking end stubs") MVN match%,#0 ;NOT(0) =TRUE MOV R5,#1 SUB store%,target%,old%
The first thing we do is to set match% to TRUE instead of FALSE - the aim of this is to allow us to branch straight to .return on a successful match, saving an instruction inside a loop later on. R5 is to be used as a counter and so we initialise it. (To avoid confusion we write R5 and not its alias t_ptr% here, because we are just using it as a spare register and not as a pointer.)
old% was set up right at the start of the wildcard handling and holds the old position of target%. target% was then used to scan along the target string to find the end of the fixed section. The difference between the two gives us the length of the end stub we are looking for.
.findend LDRB temp%,[field%],#1 CMP temp%,#13 ADD R5,R5,#1 ;find out how many ;characters BNE findend ;were left CMP R5,store% BLO nomatch ;if R5 is less, end ;stub cannot match ;(no room for it!)
We now quickly scan ahead, using R5 to count how many characters we have found, until we get to the end of the search field. This allows us to make a quick check to eliminate strings that obviously don't match - if we are searching for DE*RATION, we can discard DEFT without even knowing what the last two letters are, on the simple grounds that RATION won't fit into two letters.
.stubloop LDRB char%,[target%,-store%] LDRB temp%,[field%,-store%] SUBS store%,store%,#1 BEQ return ;entire length of ;stubs matched - jump ;straight to exit! CMP char%,temp% BEQ stubloop ;check that all ;chars are the same
We now count backwards from the end, comparing characters as we go, until either we have managed to match the entire length of the fixed section (store%=0) or we find characters that don't match. Note that here we are using pre-indexed addressing (thus the values of target% and field% remain the same each time around the loop), with -store% as an offset. It's not only possible to use negative numbers as offsets e.g. [field%,#-4], but also negative registers!
.nomatch MOV match%,#0 ;else match%=FALSE FNmessage("stubs do not match") B return
Finally, if the match failed, we reset match% appropriately and jump straight to the exit.
B%=block1%:G%=block2%:C%=0 answer%=USR(code%)
The Basic variable answer% will now hold either TRUE or FALSE depending on the result of the match.
Also, T*AN?ATION fails to match TRANSPLANT-ATION, because the program takes the first AN it can find after the wildcard and treats the subsequent failure to match as definitive, whereas properly speaking it should then go back to see if there is a correct AN later on.
Source: | Archive Magazine 14.7 |
Publication: | Archive Magazine |
Contributor: | Harriet Bazley |