Dear Jim, You write:
As I recall, you wanted to lose the quote marks in the match; why not simply use a regex like:
* ... boundary="*\/[^"; ]+
In the simple tests I just ran, this extracts the boundary string, minus double quote chars, if present. Do you have examples of boundary strings for which this regex fails?
Okay, Jim, I just ran a couple of tests with the following recipe, and it worked perfectly, stripping the quotes when present, and grepping the boundary string just fine in either case... (<space> and <tab> represent the literal characters.) :0 fhw * ^content-type:\s*\<multipart.*\<boundary="*\/[^";<space><tab>]+ myvariable = $MATCH Unless something breaks it in further testing, that's exactly the kind of compact, relatively readable recipe I've been trying to come up with... but something damn subtle is going on here! In my own defense, I would like to note that I tried the following regex on March 2: * ^content-type:.*multipart.*boundary="*\/.+ ...and it included the leading quote on a quoted boundary string in the $MATCH. I just tested it again, and that behavior is repeatable. So there's apparently an important difference between: boundary="*\/.+ and... boundary="*\/[^"]+ Looking for an explanation of this, I remember a discussion in one of the web pages I've been studying, about the processing of the \/ extraction operator. It can be found at: http://pm-doc.sourceforge.net/pm-tips.html ... Section 6.13, Understanding procmail's minimal matching (stingy vs. greedy). This seems to explain what's going on, but it sure took a lot of thinking to see what's making Jim's version work where the one I tried a week ago doesn't... According to the above page, procmail makes two passes over the string, the first to determine the stingiest match to the *entire* expression; then the second to get the greediest match to the right half of the expression starting with the first character left after the first pass. If the string we're grepping is this: boundary="abcdefgh" What is the shortest possible (stingy) match to our two regex's? boundary="*\/.+ # should match: boundary= # because "* can be null boundary="*\/[^"]+ # shoud also match: boundary= In both cases, the remaining unmatched part of the string is: "abcdefgh" And the right half of our two regex's (greedy) evaluate as: .+ # matches: "abcdefgh" [^"]+ # matches: abcdefgh This last bit is what really took me some time to realize... It's the operation of the + that forces it to match at least one non-" character... meaning that it will, in fact, **skip over** double-quote characters in the remaining string until it finds at least one non-" character, and then match the rest of the string until it hits another double quote! <sigh> IS THAT RIGHT??? Please!? <LOL!> And, going back to my very first attempts at this regex, I tried the ? modifier instead of * and it works too! * ^content-type:\s*\<multipart.*\<boundary="?\/[^";<space><tab>]+ So my core problem was not understanding the subtleties of how extraction works... <sigh> My next question, Jim... Did you understand all that when you made the suggestion? If so, my hat's off to you! <salute> Thank you! That solves my immediate problem -- and teaches me a whole lot in the process. I still have a couple more questions, just to complete my learning from this fine experience, but that's enough for one e-mail, however. See my next, please... Mike D.