MULCH - a filter to do MULtiple CHanges
Freeware by David Mitchell - dave@zenonic.demon.co.uk
Function
This filter acts just like an editor given a succession of global change commands. It is designed to be a fast way of making several changes to a file in one go.
MULCH allows changes to be specified as character strings, but provides a number of metacharacters to make it easy to restrict changes to the start or ends of lines, change carriage returns or escapes into printable characters etc.
Usage
MULCH expects a file of change requests to tell it what changes to make. You invoke it like this:
MULCH <infile >outfile changefile
Note that MULCH is a filter and you have to use redirection to get it to read from or write to files.
The changefile can contain up to 50 lines, each containing a change request in the form:
/from/to/ commentswhere 'from' and 'to' are character strings and / is any character (except *) which does not appear in either 'from' or 'to'
A line beginning with an asterisk '*' is treated as a comment.
Note that the first character on the line is taken as the delimiter. The 'to' string can be null, but the 'from' string may not be. The maximum length of a line is 255 characters and each must be terminated with a carriage return/linefeed.
Lines beginning with '*' are treated as comments, and are merely displayed by MULCH.
The following special 'metacharacters' can appear in either string:
@c = carriage return (X'0D')
@l = line feed (X'0A')
@t = tab (X'09')
@e = escape (X'1B')
@z = control-z (EOF) (X'1A')
@@ = @
@hnn = character x'nn'
Metacharacters can be in upper or lower case.
Just typing MULCH will produce some on-line help.
Examples
Here are some sample changefile lines.
/this/that/ - changes each occurrence of 'this' into 'that'
?this?that? comment - as above, the comment is ignored
this that - deletes 'this' since the delimiter is a blank
and there are two blanks between 'this' and
'that'. The second of these is thus the third
delimiter and ends the change request. The
word 'that' is treated as a comment
/th@h61t/this/ - changes 'that' back into 'this', since X'61'
is 'a'
/ @c/@c/ - removes a single trailing blank from all lines
that have one
/@c@l@c@l/@c@l/ - deletes all null lines, that is situations
where one carriage return/line feed sequence is
immediately followed by another
/@l@c// - also deletes null lines, more simply
/@l@t/@l / - turns leading tab character into single blank
(except on the very first line, which won't
have a preceding line feed)
/:ol@c/@ol.@c/ - 'legalises' a GML ordered list tag, turning an
':ol' on a line by itself into ':ol.'
The last example was the reason I wrote MULCH. I had to make sure that all the GML list tags in a set of large files were strictly legal. Thus an ':ol' on a line by itself had to be changed to ':ol.', while an ":ol.' was to be left untouched. It was taking too long to do the work using a text editor (I had over 600K of text in lots of files to process) and none of the other tools I tried would do the job without a lot of fiddling about. With MULCH and a simple file of change requests the job was almost trivial.
Note that with one exception noted below, MULCH tries to work just like an editor would if you entered the change commands one after the other. If you wanted to change the word 'the' to 'THE' without also changing 'then' to 'THEn' or 'rather' to raTHEr' etc, here's the change file you'd need:
/ the / THE / deal with all occurrences within a line /@lthe /@lTHE / deal with all occurrences at the start of a line / the@c/ THE@c/ deal with all occurrences at the end of a line
This will almost do - it won't change 'the' if it occurs at the very beginning of the file, and it won't deal with 'the' preceded or followed punctuation (commas etc) or tab characters.
The MULCH Algorithm
Any program that tries to handle multiple global changes has to handle some awkward issues. The MULCH algorithm is a compromise between speed, simplicity and avoidance of infinite loops. Basically MULCH:
- reads a chunk of up to 20K of the input file into a buffer.
- makes one pass through this buffer for each change command in turn, writing the changed data to a second buffer. Obviously the size of the chunk may increase or decrease during this process. MULCH will complain if the data in the second buffer expands to more than 28K during a pass.
- at the end of each pass MULCH switches the buffers over, treating what was the target buffer as the source and vice versa. No data is moved, it just switches some pointers.
- when all passes are complete the chunk is written out and the next 20K's worth of data is processed.
- if a match is in progress when the end of a chunk is reached, MULCH reads a byte at a time until the matching either fails or succeeds. Thus matches that span from chunk to chunk are dealt with properly (except for an exception mentioned below).
The effect of this algorithm is that not much time is spent shuffling data about, I/O is generally fast, spill files and infinite loops are avoided and most situations are properly handled.
Warning
There is one situation in which MULCH doesn't do quite what it should. Imagine that just at the end of a chunk the following string occurs:
....ABCDEFGHI....
where the E is the last character in a chunk. Now if the change request file looks like this:
/FGH/XYZ/
/EF/12/
Then you'd expect the result to be:
....ABCDEXYZI....
since the first change should be applied first. But this won't happen. On the first pass, MULCH will stop at the E, since no match is in effect at that point. Thus it won't see the 'F' yet. On the second pass through the same data a match will be in effect at the end, and the next byte, 'F', will be read and 'EF' will be changed to '12'. So the result will actually be:
....ABCD12GHI....
I can't see any simple way round this that won't seriously impact MULCH's performance. As you can see, the problem is confined to certain types of overlapping changes, so the best course is to avoid these - doing them as entirely separate runs. I can see a complicated way of solving the problem but I'd be grateful for suggestions.
Error Messages
The following errors messages may appear:
>> Missing Changefile name
This occurs if the command line has nothing but blanks
>> Error reading input file
>> Error opening changefile
>> Error reading changefile
>> Error writing output file
These occur if DOS reports I/O errors
>> Source string must not be null
A request of the form //this/ does not make any sense to MULCH.
>> Error in hex string
If @h is not followed by a pair of valid hex digits this message
will appear. Note that upper or lower case digits are allowed.
>> Error in change command
This appears if less than three delimiters are found on a line.
>> Error in metacharacter
If @ is followed by a character other than c,l,t,e,z,@ or h.
>> Too many change commands
MULCH is limited to 50 change requests - and the change buffer is
limited to 4k in size.
>> Changes have increased text beyond buffer size
MULCH cannot continue
As explained above, MULCH reads a file in 20K chunks. If a chunk
expands beyond 28K in size during the application of the set of
change requests processing is terminated.
>> Insufficient memory - MULCH needs 64K
MULCH uses a full 64K of memory - it has two 28K buffers for data
and a 4K buffer for change requests. The code is about 2K in size.