Followup on egrep regex problem...

M.G. Devour

8 Mar 2010 8 Mar '10

5:50 a.m.

Jim, Bart, Cary, Thanks for the responses. I had to be away for a couple of days to take care of other stuff, which had the benefit of giving me some distance from the problem, too. Yes, it appears to be related to the match operator and greedy/stingy behavior. Here is a solution I've come up with which *SEEMS* to work: :0 fhw * ^content-type:(.*\<)?multipart.*\<(boundary=\/[^"; ]+| boundary="\/[^"; ]+) { TESTVAR = $MATCH } Since I have two mutually exclusive cases, quoted and unquoted, only one of the two branches will match at any time: boundary=\/[^"; ]+ or boundary="\/[^"; ]+ For a quoted string, the first branch will stop matching on the first character after the \/, but the second branch should match until the trailing double-quote is reached. For an unquoted string, the first branch matches the complete boundary string, while the second branch will not match to the left of the \/. It works on well behaved samples with both quoted and unquoted boundary strings but doens't handle all possible whitespace characters after the string. I could try to create a [...] expression with all possible characters (except ; and " and <space>), but it would break if I miss even one oddball character that might turn up in a boundary string. The following looks to be adequate after analyzing a few thousand recent list mails and testing with several clients and webmail systems... [-+=/_.:A-Za-z0-9] Substituting this positive bracket expression in the above recipe (twice!) seems to work decently well as long as the strings behave themselves. Instead of this, it would make the most sense to match anything except double-quote, semi-colon, or whitespace, except that we don't seem to be allowed to use control characters, POSIX character classes or shortcuts ( e.g.: [:space:] or \s ) inside bracket expressions in procmail? This seems to be impossible to do: [^";<something that stands for all whitespace characters>] I wish I could understand why? Cary, you say that procmail doesn't allow escaping of control characters. Does that mean things like \f \n \r \n \t \v will not be understood in any context? Or just inside bracket expressions? Can you or anyone point me to where that's documented, please? Thanks, Mike D.

Attachments:

attachment.html (text/html — 2.7 KB)

Show replies by date

Jim Osborn

8 Mar 8 Mar

3:02 p.m.

New subject: Followuponegrepregexproblem...

On Sun, Mar 07, 2010 at 11:50:19PM -0500, Mike Devour wrote:

...

Here is a solution I've come up with which *SEEMS* to work:

:0 fhw * ^content-type:(.*\<)?multipart.*\<(boundary=\/[^"; ]+| boundary="\/[^"; ]+) { TESTVAR = $MATCH }

As I recall, you wanted to lose the quote marks in the match; why not simply use a regex like: * ... boundary="*\/[^"; ]+ In the simple tests I just ran, this extracts the boundary string, minus double quote chars, if present. Do you have examples of boundary strings for which this regex fails?

...

... it would make the most sense to match anything except double-quote, semi-colon, or whitespace, except that we don't seem to be allowed to use control characters, POSIX character classes or shortcuts ( e.g.: [:space:] or \s ) inside bracket expressions in procmail? This seems to be impossible to do:

[^";<something that stands for all whitespace characters>]

I wish I could understand why?

Cary, you say that procmail doesn't allow escaping of control characters. Does that mean things like \f \n \r \n \t \v will not be understood in any context? Or just inside bracket expressions? Can you or anyone point me to where that's documented, please?

man procmailrc, in the MISCELLANEOUS section says: The regular expression engine built into procmail does not support named character classes. I don't see any explicit mention of the \f... syntax but I don't think it's allowed. What's wrong with matching "anything except double-quote, semi-colon, or whitespace" with [^";XXX] where "XXX" is your favorite set of space, tab, ^M, etc.? Jim

M. G. Devour

11 Mar 11 Mar

1:06 a.m.

New subject: BIZARRO!Progress,butohmyachinghead!!???

Dear Jim, You write:

...

As I recall, you wanted to lose the quote marks in the match; why not simply use a regex like:

* ... boundary="*\/[^"; ]+

In the simple tests I just ran, this extracts the boundary string, minus double quote chars, if present. Do you have examples of boundary strings for which this regex fails?

Okay, Jim, I just ran a couple of tests with the following recipe, and it worked perfectly, stripping the quotes when present, and grepping the boundary string just fine in either case... (<space> and <tab> represent the literal characters.) :0 fhw * ^content-type:\s*\<multipart.*\<boundary="*\/[^";<space><tab>]+ myvariable = $MATCH Unless something breaks it in further testing, that's exactly the kind of compact, relatively readable recipe I've been trying to come up with... but something damn subtle is going on here! In my own defense, I would like to note that I tried the following regex on March 2: * ^content-type:.*multipart.*boundary="*\/.+ ...and it included the leading quote on a quoted boundary string in the $MATCH. I just tested it again, and that behavior is repeatable. So there's apparently an important difference between: boundary="*\/.+ and... boundary="*\/[^"]+ Looking for an explanation of this, I remember a discussion in one of the web pages I've been studying, about the processing of the \/ extraction operator. It can be found at: http://pm-doc.sourceforge.net/pm-tips.html ... Section 6.13, Understanding procmail's minimal matching (stingy vs. greedy). This seems to explain what's going on, but it sure took a lot of thinking to see what's making Jim's version work where the one I tried a week ago doesn't... According to the above page, procmail makes two passes over the string, the first to determine the stingiest match to the *entire* expression; then the second to get the greediest match to the right half of the expression starting with the first character left after the first pass. If the string we're grepping is this: boundary="abcdefgh" What is the shortest possible (stingy) match to our two regex's? boundary="*\/.+ # should match: boundary= # because "* can be null boundary="*\/[^"]+ # shoud also match: boundary= In both cases, the remaining unmatched part of the string is: "abcdefgh" And the right half of our two regex's (greedy) evaluate as: .+ # matches: "abcdefgh" [^"]+ # matches: abcdefgh This last bit is what really took me some time to realize... It's the operation of the + that forces it to match at least one non-" character... meaning that it will, in fact, **skip over** double-quote characters in the remaining string until it finds at least one non-" character, and then match the rest of the string until it hits another double quote! <sigh> IS THAT RIGHT??? Please!? <LOL!> And, going back to my very first attempts at this regex, I tried the ? modifier instead of * and it works too! * ^content-type:\s*\<multipart.*\<boundary="?\/[^";<space><tab>]+ So my core problem was not understanding the subtleties of how extraction works... <sigh> My next question, Jim... Did you understand all that when you made the suggestion? If so, my hat's off to you! <salute> Thank you! That solves my immediate problem -- and teaches me a whole lot in the process. I still have a couple more questions, just to complete my learning from this fine experience, but that's enough for one e-mail, however. See my next, please... Mike D.

Bart Schaefer

4:25 p.m.

New subject: BIZARRO!Progress,butohmyachinghead!!???

On Wed, Mar 10, 2010 at 4:06 PM, M. G. Devour <mdevour@eskimo.com> wrote:

...

Dear Jim,

So there's apparently an important difference between:

boundary="*\/.+ and... boundary="*\/[^"]+

Yes, I believe I explained this on the original thread several days ago.

...

According to the above page, procmail makes two passes over the string, the first to determine the stingiest match to the *entire* expression; then the second to get the greediest match to the right half of the expression starting with the first character left after the first pass.

Re-read what you just wrote there, particularly the emphasis on "entire".

...

If the string we're grepping is this:

boundary="abcdefgh"

What is the shortest possible (stingy) match to our two regex's?

You're asking the wrong question. The correct question is "What is the shortest possible match that also results in the entire expression matching?"

...

boundary="*\/.+ # should match: boundary= # because "* can be null

boundary="*\/[^"]+ # shoud also match: boundary=

Wrong. Are you familiar with Perl? The equivalent expression is this: boundary="*?[^"]+ In other words, you need to consider what would be matched with the \/ operator removed from the expression, and then insert the break between matched portions as far to the left as possible.

...

In both cases, the remaining unmatched part of the string is:

"abcdefgh"

No, there is no "remaining unmatched portion". Emphasis on "entire", above.

...

And the right half of our two regex's (greedy) evaluate as:

.+ # matches: "abcdefgh"

[^"]+ # matches: abcdefgh

This last bit is what really took me some time to realize... It's the operation of the + that forces it to match at least one non-" character... meaning that it will, in fact, **skip over** double-quote characters in the remaining string until it finds at least one non-" character, and then match the rest of the string until it hits another double quote! <sigh>

IS THAT RIGHT??? Please!?

No. The + forces [^"] to match at least one non-double-quote, which (when matching the entire string on the first pass) forces "* to consume the double-quote as part of the left portion.

...

And, going back to my very first attempts at this regex, I tried the ? modifier instead of * and it works too!

Same explanation.

M. G. Devour

1:11 a.m.

New subject: Followuponegrepregexproblem...

Dear Jim, You wrote:

...

What's wrong with matching "anything except double-quote, semi-colon, or whitespace" with [^";XXX] where "XXX" is your favorite set of space, tab, ^M, etc.?

I again want to ask how one goes about getting "your favorite set of space, tab, ^M, etc." into the bracket expression? Using nano on the host machine via a ssh session, I seem to be able to type in a tab character inside the [^"; <tab>] and at least it doesn't break anything badly enough to prevent the recipe from working. (I would have to create a test message where the string was delimited by a tab character to prove that it actually *works*!) But since escaped control characters (\t,\n, etc.) don't seem to be allowed inside the [...] of a character class in procmail, just what sort of syntax am I allowed to use? What would it look like? Would I have to use variables, perhaps? Thank you, Mike D.

Bart Schaefer

8 Mar 8 Mar

8:17 p.m.

New subject: Followuponegrepregexproblem...

On Sun, Mar 7, 2010 at 8:50 PM, M.G. Devour <mdevour@eskimo.com> wrote:

...

Here is a solution I've come up with which *SEEMS* to work:

:0 fhw * ^content-type:(.*\<)?multipart.*\<(boundary=\/[^"; ]+|boundary="\/[^"; ]+)

I don't believe there's any guarantee that using \/ twice in the same RE will do the right thing. If it seems to be working, though ... You might find http://www.well.com/user/barts/email/mimepart.txt to be useful.

...

Instead of this, it would make the most sense to match anything except double-quote, semi-colon, or whitespace, except that we don't seem to be allowed to use control characters, POSIX character classes or shortcuts ( e.g.: [:space:] or \s ) inside bracket expressions in procmail? This seems to be impossible to do:

[^";<something that stands for all whitespace characters>]

I wish I could understand why?

Because procmail's regex code is roughly 25 years old and complex character class shortcuts didn't exist when it was written.

M. G. Devour

11 Mar 11 Mar

1:10 a.m.

New subject: Followuponegrepregexproblem...

Bart,

...

...
:0 fhw * ^content-type:(.*\<)?multipart.*\<(boundary=\/[^"; ]+|boundary="\/[^"; ]+)

I don't believe there's any guarantee that using \/ twice in the same RE will do the right thing. If it seems to be working, though ...

It seemed to, but I'm not sure my brain is up to trying to test that hypothesis against my newfound understanding of extraction! Maybe tomorrow... <shudder>

...

You might find http://www.well.com/user/barts/email/mimepart.txt to be useful.

Oh, boy... I have to look at it some more, but it looks like you've just *handed* me between 95-100% of what I'm trying to do! THANK YOU!

...

Because procmail's regex code is roughly 25 years old and complex character class shortcuts didn't exist when it was written.

Fair enough. I'm left with my question to Jim, however, of how you manage to represent control characters inside the brackets of a character class expression? Or is that something you always have to work-around to avoid? Thank you very much, sir. Mike D.

Jim Osborn

9:21 a.m.

New subject: Followuponegrepregexproblem...

On Thu, Mar 11, 2010 at 12:05:47AM -0005, Mike Devour wrote:

...

Fair enough. I'm left with my question to Jim, however, of how you manage to represent control characters inside the brackets of a character class expression? Or is that something you always have to work-around to avoid?

I just type them in literally, using whatever special-char-quoting mechanism is available in the text editor I'm using at the time.

...

From time to time I've resorted to something like: echo -e "section: \247" >XXX to get the special character into a file. Then I can copy and paste from that file using a variety of means, whatever works for the particular job at hand. But I've never had to do any of that in procmail.

Cheers, Jim

Bart Schaefer

4:38 p.m.

New subject: Followuponegrepregexproblem...

On Wed, Mar 10, 2010 at 4:10 PM, M. G. Devour <mdevour@eskimo.com> wrote:

...

...
You might find http://www.well.com/user/barts/email/mimepart.txt to be useful.

Oh, boy... I have to look at it some more, but it looks like you've just *handed* me between 95-100% of what I'm trying to do!

THANK YOU!

You're welcome. Hope it works out. I just noticed that there's a potential additional bug with mimepart.txt and very new MIME messages -- there has been an extension to the MIME boundary-string specification to allow very long boundary strings to be split across multiple lines in the message header, which the MIME decoder is then supposed to paste back together before examining the message body. I've only seen this once in actual email, and many common email readers fail to deal with it, so you probably won't encounter it any time soon, but be advised.

5594

Age (days ago)

5597

Last active (days ago)

List overview

Download

8 comments

4 participants

participants (4)

Bart Schaefer
Jim Osborn
M. G. Devour
M.G. Devour

Followup on egrep regex problem...

tags

participants (4)