Procmail egrep regex problem...
I'm doing this on the system I have this e-mail account with, eskimo.com. It's an old-fashioned geek's ISP, where I have a shell account on a shared server, rather than a slice, or fancy web-thingies to handle everything. They're running up-to-date procmail, smartlist and sendmail. I've asked support if there are any local configuration considerations that might be affecting things. I should have a reply in a day or so. The issue I have is things that ought to work don't, and I suspect something about the context prevents it, or else I've missed something I'm supposed to be doing, like escaping characters or putting ticks or quotes around the regex... though I've not seen any examples that tell me I should. I'm trying to extract the boundary string from multipart MIME formatted messages into a variable so I can do futher processing. This is happening in rc.local.s00. The header data will look like a variation on these examples: Content-Type: multipart/alternative boundary="a=bunch-of/stuff:and_random=junk" Content-Type: multipart/alternative; boundary=a=bunch-of/stuff:and_random=junk What's reliable is Content-Type: multipart/* occurs on one line; the boundary string will not contain quotes, backslashes, semicolons or whitespace; the boundary string is either at the end of the first line or on a line of its own; whether multi-line or concatenated, the various pieces are separated by semicolons or whitespace. As I understand it, the egrep for procmail recipe condition lines assumes the c flag, and treats multi-line headers as single lines. Thus, the following recipe works up to a point, even though the '.' token should not ever match a newline. I know there are recipes out there for doing this, but I'm stumped as to why the following approach doesn't work (yet) and am trying to learn the reason before I just give up and do something else. Here's the basic strategy: TESTVAR = 'some initial value' :0 fhw * ^content-type:.*multipart.*boundary=\/.* { TESTVAR = $MATCH } In the two examples above, it yields the following in $TESTVAR: Example 1) "a=bunch-of/stuff:and_random=junk" Example 2) a=bunch-of/stuff:and_random=junk One initial oversight of this recipe is that it won't deal with a multi- line header where there's another sub-header after the boundary= line. It would just tack on the rest of that header, right to the end of the last line. It ought to be easy to stop at the end of the boundary string, but it's not, as I explain below. At this point, I start to hit a brick wall. Nothing that's supposed to work seems to work in this context... I'd like to get rid of the quote marks. This recipe will strip the leading quote, at least, but will not match anything if the boundary string is unquoted, as it is on many messages: :0 fhw * ^content-type:.*multipart.*boundary="\/.* The following ought to solve that problem by matching either 0 or 1 occurrences of the double quote after the equals sign: :0 fhw * ^content-type:.*multipart.*boundary="?\/.* It doesn't. The string still shows up in $MATCH and $TESTVAR with both quote marks intact. WHY? Every other syntax I've tried (dozens of them!) refuses to function in this context, either not changing the value of TESTVAR, setting it to a null string, or blithely ignoring the quote mark and letting it go through into $MATCH. After the \/ token, it seem to be TRIVIAL to match any character that wasn't a quote, semicolon or whitespace, and that would capture the entire boundary string... AND solve the multi-line problem -- but, again, absolutely none of the applicable syntax wants to work here. So, what is it about the context that refuses to let me do an elective match on the leading quote mark and prevents almost anything except .* from working after the \/ ? There must be something pretty basic I'm missing here, but, damn! I've been studying this for a week and a half steady, and I have *not* seen an explanation for this behavior. Am I missing something else, obvious or not? Does it seem like something in my environment is screwing it up? I have not turned on logging yet, but will shortly. Am I right to assume procmail logging is more important than smartlist's for this? Any help or insight would be appreciated. Thanks, Mike D. [Mike Devour, Citizen, Patriot, Libertarian] [mdevour@eskimo.com ] [Speaking only for myself... ]
On Thu, Mar 04, 2010 at 04:16:44AM -0005, M. G. Devour wrote:
Content-Type: multipart/alternative boundary="a=bunch-of/stuff:and_random=junk"
This version is an invalid email header. Something has broken it before it has reached you.
Am I missing something else, obvious or not? Does it seem like something in my environment is screwing it up?
Nothing springs to mind immediately.
I have not turned on logging yet, but will shortly. Am I right to assume procmail logging is more important than smartlist's for this?
Yes. I'll give this more thought... R
On Thu, Mar 04, 2010 at 04:16:44AM -0005, M. G. Devour wrote:
Content-Type: multipart/alternative boundary="a=bunch-of/stuff:and_random=junk"
This version is an invalid email header. Something has broken it before it has reached you.
I typed that in by hand from memory, Roger, and I think it wrapped in my mailer, besides. Don't take it too seriously! <grin> What is invalid about it? I know that some major e-mail vendors quote the boundary string, while others don't, and that if boundary= is on its own line it must be preceeded by whitespace. I have thousands of archived messages to analyze, and at least 3 different mail systems to test with, so I'll sure find out if I'm missing anything from that end.
Am I missing something else, obvious or not? Does it seem like something in my environment is screwing it up?
Nothing springs to mind immediately.
Darn! <grin> If I can even find out why "? doesn't reliably match an optional double quote, I'd be much farther along. That's the first annoying failure.
I have not turned on logging yet, but will shortly. Am I right to assume procmail logging is more important than smartlist's for this?
Yes.
Good. Procmail it is.
I'll give this more thought...
Thank you. Mike D. [Mike Devour, Citizen, Patriot, Libertarian] [mdevour@eskimo.com ] [Speaking only for myself... ]
If I can even find out why "? doesn't reliably match an optional double quote, I'd be much farther along. That's the first annoying failure.
That is puzzling, but it's probably not the only bug in procmail's RE handling. If a ? (or * or +) immediately precedes \/, it seems to screw up the RE processing. You could instead grab the whole boundary parameter in one step, then strip the quotes after that: :0 H # NOTE: The ^M in the RE needs to be a literal carriage # return, not the two separate characters ^ and M. # Procmail does not support control-character escapes. # This is necessary because some mail comes in with CR-LF # line endings, and you don't want the CR included in the match. * ^Content-Type:\W*\<multipart.*\<boundary=\/[^; ^M]* { TESTVAR = $MATCH :0 * TESTVAR ?? "\/[^"]* { TESTVAR = $MATCH } } In limited testing, this seems to work fine. -cary
On Fri, Mar 5, 2010 at 4:15 PM, Cary Coutant <cary@bayarea.net> wrote:
If I can even find out why "? doesn't reliably match an optional double quote, I'd be much farther along. That's the first annoying failure.
That is puzzling, but it's probably not the only bug in procmail's RE handling. If a ? (or * or +) immediately precedes \/, it seems to screw up the RE processing.
It's not a bug, it's intentional. It's like using the *? operator in perl, if you're familiar with that.
You could instead grab the whole boundary parameter in one step, then strip the quotes after that:
Cary, thanks for the input... How do I get the literal ctrl-M into the bracket expression? Type ctrl- M into nano? It would make the regexp seem to break across a line? The second recip looks like a simple way to get rid of the quotes if they're there.
:0 H # NOTE: The ^M in the RE needs to be a literal carriage # return, not the two separate characters ^ and M. # Procmail does not support control-character escapes. # This is necessary because some mail comes in with CR-LF # line endings, and you don't want the CR included in the match. * ^Content-Type:\W*\<multipart.*\<boundary=\/[^; ^M]* { TESTVAR = $MATCH
:0 * TESTVAR ?? "\/[^"]* { TESTVAR = $MATCH } }
On Sun, Mar 7, 2010 at 3:25 PM, M. G. Devour <mdevour@eskimo.com> wrote:
How do I get the literal ctrl-M into the bracket expression?
Literal ctrl-M is probably not what you want anyway, because during the SMTP transaction lines are delimited by "\r\n" and by the time procmail gets the message it probably has been rewritten to have only "\n" (which is ctrl-J). Instead you should make use of another procmail oddity, which is that the $ [end-of-line] marker actually matches a newline character rather than matching a zero-width boundary as it does in other regex engines like perl's. So if you use ([^;]|($))+ you'll match one or more non-semicolon or newline characters. The extra parens around ($) are an idiom to distinguish it from introducing a variable expansion.
On Wed, Mar 3, 2010 at 8:21 PM, M. G. Devour <mdevour@eskimo.com> wrote:
The following ought to solve that problem by matching either 0 or 1 occurrences of the double quote after the equals sign:
:0 fhw * ^content-type:.*multipart.*boundary="?\/.*
It doesn't. The string still shows up in $MATCH and $TESTVAR with both quote marks intact. WHY?
The \/ operator has the side-effect of making regular expressions to the LEFT of the \/ match the MINIMUM possible string, and expressions to the RIGHT of the \/ match the MAXIMUM possible string. So any RE on the left that ends with a wildcard will always prefer to treat the wildcard as a non-match UNLESS that would cause the RE on the right to fail.
On Fri, Mar 05, 2010 at 4:32:38PM -0800, Bart Schaefer wrote:
The \/ operator has the side-effect of making regular expressions to the LEFT of the \/ match the MINIMUM possible string, and expressions to the RIGHT of the \/ match the MAXIMUM possible string...
Which explains why <blah>\/<blah>+ works where <blah>\/<blah>* may not.
participants (5)
-
Bart Schaefer
-
Cary Coutant
-
Jim Osborn
-
M. G. Devour
-
Roger Burton West