Jim, Bart, Cary,
Thanks for the responses. I had to be away for a couple of days to take care of other stuff, which had the benefit of giving me some distance from the problem, too.
Yes, it appears to be related to the match operator and greedy/stingy behavior.
Here is a solution I've come up with which *SEEMS* to work:
:0 fhw
* ^content-type:(.*\<)?multipart.*\<(boundary=\/[^"; ]+|boundary="\/[^"; ]+)
{
TESTVAR = $MATCH
}
Since I have two mutually exclusive cases, quoted and unquoted, only one of the two branches will match at any time:
boundary=\/[^"; ]+ or boundary="\/[^"; ]+
For a quoted string, the first branch will stop matching on the first character after the \/, but the second branch should match until the trailing double-quote is reached.
For an unquoted string, the first branch matches the complete boundary string, while the second branch will not match to the left of the \/.
It works on well behaved samples with both quoted and unquoted boundary strings but doens't handle all possible whitespace characters after the string.
I could try to create a [...] expression with all possible characters (except ; and " and <space>), but it would break if I miss even one oddball character that might turn up in a boundary string. The following looks to be adequate after analyzing a few thousand recent list mails and testing with several clients and webmail systems...
[-+=/_.:A-Za-z0-9]
Substituting this positive bracket expression in the above recipe (twice!) seems to work decently well as long as the strings behave themselves.
Instead of this, it would make the most sense to match anything except double-quote, semi-colon, or whitespace, except that we don't seem to be allowed to use control characters, POSIX character classes or shortcuts ( e.g.: [:space:] or \s ) inside bracket expressions in procmail? This seems to be impossible to do:
[^";<something that stands for all whitespace characters>]
I wish I could understand why?
Cary, you say that procmail doesn't allow escaping of control characters. Does that mean things like \f \n \r \n \t \v will not be understood in any context? Or just inside bracket expressions? Can you or anyone point me to where that's documented, please?
Thanks,
Mike D.