Apr 12 10

If your application sends SMS you want to make sure that characters to be sent comply with GSM 03.38 (e.g. 7bit).

If you need to validate (user) input for invalid characters you could do that pretty easily with regular expressions. For Java the following regex is ready to use. For other languages you can base your implementation on the “pure” unescaped regex shown in the first line of  the code comment.

/*-
 * ^[A-Za-z0-9 \r\n@£$¥èéùìòÇØøÅå\u0394_\u03A6\u0393\u039B\u03A9\u03A0\u03A8\u03A3\u0398\u039EÆæßÉ!"#$%&'()*+,\-./:;<=>?¡ÄÖÑܧ¿äöñüà^{}\\\[~\]|\u20AC]*$
 *
 * Assert position at the beginning of the string «^»
 * Match a single character present in the list below «[A-Za-z0-9 \r\n@£$¥èéùìòÇØøÅå\u0394_\u03A6\u0393\u039B\u03A9\u03A0\u03A8\u03A3\u0398\u039EÆæßÉ!"#$%&'()*+,\-./:;<=>?¡ÄÖÑܧ¿äöñüà^{}\\\[~\]|\u20AC]*»
 *    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
 *    A character in the range between "A" and "Z" «A-Z»
 *    A character in the range between "a" and "z" «a-z»
 *    A character in the range between "0" and "9" «0-9»
 *    The character " " « »
 *    A carriage return character «\r»
 *    A line feed character «\n»
 *    One of the characters "@£$¥èéùìòÇØøÅå" «@£$¥èéùìòÇØøÅå»
 *    Unicode character U+0394 «\u0394», Greek capital Delta
 *    The character "_" «_»
 *    Unicode character U+03A6 «\u03A6», Greek capital Phi
 *    Unicode character U+0393 «\u0393», Greek capital Gamma
 *    Unicode character U+039B «\u039B», Greek capital Lambda
 *    Unicode character U+03A9 «\u03A9», Greek capital Omega
 *    Unicode character U+03A0 «\u03A0», Greek capital Pi
 *    Unicode character U+03A8 «\u03A8», Greek capital Psi
 *    Unicode character U+03A3 «\u03A3», Greek capital Sigma
 *    Unicode character U+0398 «\u0398», Greek capital Theta
 *    Unicode character U+039E «\u039E», Greek capital Xi
 *    One of the characters "ÆæßÉ!"#$%&'()*+," «ÆæßÉ!"#$%&'()*+,»
 *    A - character «\-»
 *    One of the characters "./:;<=>?¡ÄÖÑܧ¿äöñüà^{}" «./:;<=>?¡ÄÖÑܧ¿äöñüà^{}»
 *    A \ character «\\»
 *    A [ character «\[»
 *    The character "~" «~»
 *    A ] character «\]»
 *    The character "|" «|»
 *    Unicode character U+20AC «\u20AC», Euro sign
 * Assert position at the end of the string (or before the line break at the end of the string, if any) «$»
 */
public static final String GSM_CHARACTERS_REGEX = "^[A-Za-z0-9 \\r\\n@£$¥èéùìòÇØøÅå\u0394_\u03A6\u0393\u039B\u03A9\u03A0\u03A8\u03A3\u0398\u039EÆæßÉ!\"#$%&'()*+,\\-./:;<=>?¡ÄÖÑܧ¿äöñüà^{}\\\\\\[~\\]|\u20AC]*$";

Geeh, who allowed the Greek to smuggle part of their alphabet into GSM 03.38? Since those characters don’t fit into Latin1 (ISO-8859-1) they should be UTF-8 encoded in the regex. More on that in this excellent regex tutorial: http://www.regular-expressions.info/unicode.html. Oh yes, and I do recommend using RegexBuddy – it really is my regex life-saver.

Sep 09 22

In most programming languages the regular expression pattern to find the digit ’1′ surrounded by ‘;’ and other digits would be something like

[;\d]*1[;\d]*

So, the pseudo character class “; or digit” is matched zero or more times, then the digit 1 is matched followed by zero or more “; or digit”s. A few examples:

<property id="foo" value=";1;;;"/>
yet another regexp test with 1;;;;3xxyyzz...
well I think you get the picture with this ;;;;1 shizzle even if it's ;1;2;3; or 123

With Oracle SQL, however, it’s a slightly different story. \d is not supported i.e. not properly recognized as being the character class for digits. However, the character class 0-9 which generally is the equivalent to \d seems to be supported. In Oracle you could therefore use

[;0-9]*1[;0-9]*

As far as I can tell this is an undocumented feature. The official Oracle regexp documentation only mentions that it supports the regular POSIX character class [:digit:]. Watch out, the equivalent to \d is the whole expression [:digit:] and not just :digit:. I was first fooled by the extra [] around the character class designator… So, according to the documentation you’d have to use

[;[:digit:]]*1[;[:digit:]]*
Jun 07 08

Today, I found myself looking for a regular expression that matches only the last occurrence of a given expression. As I’m still not a regex mastermind I couldn’t come up with it just like that.

The key to the solution is a so called “negative lookahead“. A lookahead doesn’t consume characters in the string, but only asserts whether a match is possible or not. So if you wanted to extract the last “foo” in the text “foo bar foo bar foo” your regex would look like this:

foo(?!.*foo)

If you used the DOTALL option the above expression would even work correctly on a multi-line text such as

foo
bar
foo
bar
foo

Of course the example is not taken from a real life scenario as it doesn’t matter which “foo” is matched as they’re all the same anyway. The expression would with no doubt be more complicated, but I hope you get the point.

Update:

Someone asked for an explanation…Here’s what RegexBuddy, my indispensable regex tool, produces automatically:
# foo(?!.*foo)
#
# Match the characters “foo” literally «foo»
# Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*foo)»
#    Match any single character that is not a line break character «.*»
#       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
#    Match the characters “foo” literally «foo»