BIZARRO! Progress, but oh my aching head!!???

11 Mar 2010

      Dear Jim,

You write:
...
As I recall, you wanted to lose the quote marks in the match; 
why not simply use a regex like:
* ... boundary="*\/[^"; ]+
In the simple tests I just ran, this extracts the boundary string,
minus double quote chars, if present.  Do you have examples of
boundary strings for which this regex fails?
Okay, Jim, I just ran a couple of tests with the following recipe, and 
it worked perfectly, stripping the quotes when present, and grepping 
the boundary string just fine in either case... 

(<space> and <tab> represent the literal characters.)

:0 fhw
* ^content-type:\s*\<multipart.*\<boundary="*\/[^";<space><tab>]+
   myvariable = $MATCH

Unless something breaks it in further testing, that's exactly the kind 
of compact, relatively readable recipe I've been trying to come up 
with... but something damn subtle is going on here!

In my own defense, I would like to note that I tried the following 
regex on March 2:

* ^content-type:.*multipart.*boundary="*\/.+

...and it included the leading quote on a quoted boundary string in the 
$MATCH. I just tested it again, and that behavior is repeatable.

So there's apparently an important difference between:

   boundary="*\/.+    and...
   boundary="*\/[^"]+

Looking for an explanation of this, I remember a discussion in one of 
the web pages I've been studying, about the processing of the \/ 
extraction operator. It can be found at:

   http://pm-doc.sourceforge.net/pm-tips.html

... Section 6.13, Understanding procmail's minimal matching (stingy vs. 
greedy).

This seems to explain what's going on, but it sure took a lot of 
thinking to see what's making Jim's version work where the one I tried 
a week ago doesn't...

According to the above page, procmail makes two passes over the string, 
the first to determine the stingiest match to the *entire* expression; 
then the second to get the greediest match to the right half of the 
expression starting with the first character left after the first pass.

If the string we're grepping is this:

    boundary="abcdefgh"

What is the shortest possible (stingy) match to our two regex's?

boundary="*\/.+		# should match: boundary=
						# because "* can be null

boundary="*\/[^"]+	# shoud also match: boundary=

In both cases, the remaining unmatched part of the string is: 

   "abcdefgh"

And the right half of our two regex's (greedy) evaluate as:

.+						# matches: "abcdefgh"

[^"]+					# matches: abcdefgh

This last bit is what really took me some time to realize... It's the 
operation of the + that forces it to match at least one non-" 
character... meaning that it will, in fact, **skip over** double-quote 
characters in the remaining string until it finds at least one non-" 
character, and then match the rest of the string until it hits another 
double quote! <sigh>

IS THAT RIGHT??? Please!?

<LOL!>

And, going back to my very first attempts at this regex, I tried the ? 
modifier instead of * and it works too!

* ^content-type:\s*\<multipart.*\<boundary="?\/[^";<space><tab>]+

So my core problem was not understanding the subtleties of how 
extraction works... <sigh>

My next question, Jim... Did you understand all that when you made the 
suggestion? If so, my hat's off to you! <salute>

Thank you! That solves my immediate problem -- and teaches me a whole 
lot in the process. I still have a couple more questions, just to 
complete my learning from this fine experience, but that's enough for 
one e-mail, however. See my next, please...

Mike D.

BIZARRO! Progress, but oh my aching head!!???

M. G. Devour