Last month, I gave an example of how to carry out a wildcarded search of a block of strings held in memory. In the final two parts of this series, I shall demonstrate how to adapt our basic ARM code routine to perform a task with slightly different requirements - to scan an entire text file for a given (wildcarded) string and print out the results in context.
The most important difference to be borne in mind is that the data to be checked will not, in this case, consist of relatively short CR-terminated strings, but of long sections of text separated by linefeeds (CHR$ 10). For the purposes of this program, the multiple wildcard character '*' will match any character except a linefeed; in other words, strings split across more than one paragraph cannot be matched. This seems a reasonable restriction.
The output will be in the form of single lines of text not more than 80 characters wide (scrolls nicely in a task window), centred around the found string. This implies that this string itself cannot be more than 80 characters in length, which might be a problem if we ever actually want to find a very long string; however, the most likely scenario for this to occur would be if a wildcard matched a much larger block of text than expected, when it would almost certainly not be the string the user was looking for in any case.
searchfor$="wild*ing" DIM find% 40:$find%=searchfor$ REM enter search string here name$="ADFS::4.$.testfile" REM alter filename as required SYS"OS_File",23,name$ TO found%,,,, len% IF found%=0 THEN ERROR 255,"File not found" REM get file length DIM block% len%+2 SYS "OS_File",16,name$,1+block%,0 REM load file into block ?block%=10 ?(block%+len%+2)=10
To keep the machine code simple, I have defined paragraphs as sections of text between two LF (CHR$ 10) characters. For the first and last paragraphs, this may not be the case so we take the precaution of making the block two bytes longer than necessary, loading the file into block%+1 and poking an extra LF in at the beginning and end of the file. This isn't necessary for the actual search, but it prevents random data from neighbouring sections of memory being accidentally printed out when we are scanning for the 'context' of a string found in the first or last lines.
PROCassemble F%=block%+1 REM R5 points to start of file H%=block%+len%+2 REM R7 points to end of file G%=find% REM R6 points to search string PRINT name$:REM display name of file being searched CALL ourcode% END
Since the machine code itself will be handling the formatting and the output of the results (covered in next month's article), almost all we have to do here is assemble and call it. There are only three parameters - the address of the start of the file in memory, the address of its end and the address of the search string - and, as before, we use the special variables F%, G% and H% to place the values directly into the relevant registers from Basic.
file%=5:char%=4 temp%=1:end%=7 target%=0 target_start%=6 file_backup%=2 old%=10 f_ptr%=12:t_ptr%=3 store%=9:length%=8
Eagle-eyed readers may notice that, while most of the register names (and functions) are similar to those of previous articles, many of them have ended up being assigned to different registers! The only vital assignments are target% and temp%, which need to be R0 and R1 respectively when the SWI "OS_WriteN" is called at the end of the program.
.restart MOV target%,target_start% MOV file_backup%,file% MOV length%,#0
Here we point target% to the start of the target string, initialise file_backup%, and zero the length counter. Later on, we branch back to .restart in order to restart the search.
.search LDRB char%,[file%],#1 LDRB temp%,[target%],#1 CMP temp%,#13 ;check for end of ; target string Beq splitbyeighty ;if found, we ; have a match CMP temp%,#ASC("*") ;if "*" jump ; to wildcard routine Beq wildcardM
Note that we are jumping directly to the labels in question rather than using a BL instruction. At the end of these routines, we will then jump straight back to the start of the loop - in other words, if either of these conditions are met the remainder of the loop will not be executed.
CMP temp%,#ASC("?") ;if "?" ; skip this comparison CMPne char%,temp% MOVne target%,target_start% ; restore pointer to start of target ; string if chars don't match ADDne file_backup%,file_backup%,#1 MOVne file%,file_backup%
If the match fails, we reset the target pointer as before - but since, when searching a whole file we can no longer be certain exactly where the start of the string we want is, we also now need to reset the file pointer to the character following the last unsuccessful match. For example, if we are looking for the sequence of characters 'muring' and we come across the word 'murmuring', it is not until the second 'm' of the initial 'murm' has been loaded that we will be able to tell that the start of the word will not match; but unless we then reset the pointer nearer to the start of the word, we would subsequently check only the letters 'uring' instead of 'urmuring', thus missing the fact that a successful match should have followed the failed one. The role of file_backup% is to keep track of the start of the last unsuccessful match.
CMP file%,end% ;check for end of file Blt search MOV R15, R14 ;exit routine
.wildcardM MOV old%,target% .skipfixed LDRB temp%,[target%],#1 CMP temp%,#ASC("?") CMPne temp%,#ASC("*") CMPne temp%,#13 Bne skipfixed ; scan ahead for next ; wildcard/end of string
The basic code remains much as it was described in Archive 14.7 p16; we identify the 'fixed' section in the target string (up to the next wildcard or to the end of the string), and scan ahead in the data held in memory to locate this sequence of characters if possible. Note that we can no longer use the shortcut that checks to see whether an 'end stub' - the section of the target string between the last wildcard and the end of the string - matches the end of the search data, since it is perfectly valid for this string to be found in the middle of a paragraph. For example, in the sentence above, the target string 'wild*of' would match the phrase reading 'wildcard and the end of'; but jumping to the end of the paragraph and checking to see if the last two characters were 'o' and 'f' wouldn't be very helpful!
SUB f_ptr%,file%,#1 ;allow wildcard to match zero characters .reset_target ADD store%,f_ptr%,#1 MOV t_ptr%,old% .scanloop LDRB temp%,[t_ptr%],#1 LDRB char%,[f_ptr%],#1 CMP temp%,char% Beq scanloop CMP t_ptr%,target% Beq matched ;whole of fixed section matched - success! CMP char%,#10 ;LF CMPne temp%,#13 ;CR Beq jumpback ;hit end of one ;string only - ;search failed MOV f_ptr%,store% ; move field pointer to character ; after first match B reset_target
Just as before, we set up temporary pointers, t_ptr% and f_ptr%, and use two nested loops to scan ahead character by character until we either locate the desired 'fixed section', namely 'of' in our example (success - jump out of loop) or hit either the end of the target string or the end of the line/paragraph (failure - jump to end of section). Note that the target string is CR-terminated but that lines in RISC OS text files are LF-terminated!
.matched SUB temp%,store%,file% SUB temp%,temp%,#1 ADD length%,length%,temp%
Thanks to the use of temporary pointers, file% still has the same value that it did before we began to scan ahead; it is pointing to the start of the section we skipped, which reads 'card and the end'. Meanwhile store%, which has been incremented every time around the outer loop, is now pointing to the start of the 'fixed section' we have successfully located, the 'o' of 'of', and f_ptr%, which is incremented every time around the inner loop, and is pointing to the first character after the end of the 'fixed section'.
(In reality, due to the use of post-indexed addressing, which increments the pointer after loading a byte from that address, all three registers are actually pointing one character further along; but since only their relative values, i.e. the number of characters between the pointers, are of interest here, I have chosen to ignore this for the purposes of simplifying the explanation.)
It should now be apparent that by subtracting file% from store%, we can discover the number of characters matched by the wildcard. What we ultimately want to know is how many characters longer than the target string the overall result will be; thus we subtract 1 from this value in order to allow for the length of the wildcard itself.
RSBS temp%,length%,#80-LEN(searchfor$) SUBpl target%,t_ptr%,#1 SUBpl file%,f_ptr%,#1
Here, we check that the total length matched so far is still small enough to fit in 80 characters, taking into account the additional length of the target string, and being careful to use the S suffix this time to set the status flags. Note that this is a somewhat lazy way of doing it, using the Basic assembler and assuming that the code is to be assembled 'live' immediately before being run - otherwise, it would be necessary to pass in the length of the string currently being searched for as a separate parameter.
A positive result indicates that the found string is short enough to be a reasonable match. We set the main pointers to point to the last characters successfully matched (bear in mind the post-indexed addressing!) in preparation for jumping back to the main comparison loop.
.jumpback Ble restart B search
There are three possible routes by which this point may be reached during the execution of the program, two of which indicate that the search failed and the other that a successful wildcard match was made. In true assembler fashion, we can work it out by the condition of the status flags. Either we branched here directly when the search for the 'fixed section' failed (zero flag set), or we have just executed the instructions at label matched, in which case either the match failed because the wildcarded section was too long (negative flag set) or else it has succeeded (negative flag clear). We use the condition Less than or equal to in order to detect either of the first two possibilities and branch back to restart.
Otherwise, we are clear to continue the search. If we have reached the end of the target string, this will be detected as soon as the next two characters are loaded, causing an immediate branch to the label splitbyeighty. This final section of the program is entirely concerned with the cosmetic considerations of producing useful output; next month's article will cover it in detail.
Source: | Archive Magazine 14.9 |
Publication: | Archive Magazine |
Contributor: | Harriet Bazley |