Jim, Bart, Cary,
Thanks for the responses. I had to be away for a couple of days to take
care of other stuff, which had the benefit of giving me some distance
from the problem, too.
Yes, it appears to be related to the match operator and greedy/stingy
behavior.
Here is a solution I've come up with which *SEEMS* to work:
:0 fhw
* ^content-type:(.*\<)?multipart.*\<(boundary=\/[^"; ]+|
boundary="\/[^"; ]+)
{
TESTVAR = $MATCH
}
Since I have two mutually exclusive cases, quoted and unquoted, only one
of the two branches will match at any time:
boundary=\/[^"; ]+ or boundary="\/[^"; ]+
For a quoted string, the first branch will stop matching on the first
character after the \/, but the second branch should match until the
trailing double-quote is reached.
For an unquoted string, the first branch matches the complete boundary
string, while the second branch will not match to the left of the \/.
It works on well behaved samples with both quoted and unquoted boundary
strings but doens't handle all possible whitespace characters after the
string.
I could try to create a [...] expression with all possible characters
(except ; and " and <space>), but it would break if I miss even one
oddball character that might turn up in a boundary string. The following
looks to be adequate after analyzing a few thousand recent list mails
and testing with several clients and webmail systems...
[-+=/_.:A-Za-z0-9]
Substituting this positive bracket expression in the above recipe
(twice!) seems to work decently well as long as the strings behave
themselves.
Instead of this, it would make the most sense to match anything except
double-quote, semi-colon, or whitespace, except that we don't seem to be
allowed to use control characters, POSIX character classes or shortcuts
( e.g.: [:space:] or \s ) inside bracket expressions in procmail? This
seems to be impossible to do:
[^";<something that stands for all whitespace characters>]
I wish I could understand why?
Cary, you say that procmail doesn't allow escaping of control
characters. Does that mean things like \f \n \r \n \t \v will not be
understood in any context? Or just inside bracket expressions? Can you
or anyone point me to where that's documented, please?
Thanks,
Mike D.