Tuesday, August 28, 2012

Regex Cheat Sheet

Regexes (regular expressions) are an extremely useful tool, but I find myself getting tripped up by differences in their APIs and capabilities in different languages. Here's a regex Rosetta Stone, covering how to use regexes in various programming language.

Supported syntax

Different languages and libraries support different syntaxes (supported special characters and so on).

Perl
Basic syntax, extended patterns
Python
Library reference, HOWTO

Raw strings are useful to cut down on the number of backslashes you need.

C++
Boost.Regex syntax (Perl-compatible by default, including support for Perl extended patterns)

Raw string literals are useful to cut down on the number of backslashes you need, if your compiler supports them.

C#
Regular Expression Language Elements

@-quoted string literals are useful to cut down on the number of backslashes you need.

JavaScript
“Writing a Regular Expression Pattern” on the Mozilla Developer Network

Supported modifiers

Regexes generally support modifiers to control case sensitivity, etc.

Perl
Modifiers
Python
re module constants
C++
boost::regex_constants::syntax_option_type
C#
System.Text.RegularExpressions.RegexOptions
JavaScript
See the “flags” section of the parameters to the RegExp object.

At the top of the file

To make regex functionality available in your module or source file:

Perl
Nothing necessary
Python
import re
C++
#include <boost/regex.hpp>

// Optionally:
using namespace boost::regex;
C#
using System.Text.RegularExpressions;
JavaScript
Nothing necessary

Matching an entire string

Perl
if ($text =~ /^hello \d+$/) { ... }

Perl regexes must be explicitly anchored using ^ and $ to match the entire string.

Python
if re.match(r'hello \d+$', text):
    ...

re.match starts at the beginning of the string but requires $ to anchor the match to the end of the string.

C++
if (regex_match(text, boost::regex("hello \\d+"))) { ... }

Use boost::regex_match if you want to require that the entire string match.

C#
if (Regex.isMatch(text, @"^Hello \d+$")) { ... }

C# regexes must be explicitly anchored using ^ and $ to match the entire string.

JavaScript
if (/^Hello \d+$/.test(text)) { ... }

JavaScript regexes must be explicitly anchored using ^ and $ to match the entire string.

Matching a substring

Perl
if ($text =~ /Hello \d+/) { ... }
Python
if re.search(r'Hello \d+', text):
    ...

Use re.search instead of re.match to search for a substring anywhere within the string.

C++
if (regex_search(text, boost::regex("Hello \\d+"))) { ... }

Use boost::regex_match if you want to search for a substring anywhere within the string.

C#
if (Regex.isMatch(text, @"Hello \d+")) { ... }
JavaScript
if (/Hello \d+/.test(text)) { ... }

Performing a case-insensitive match

Perl
if ($text =~ /hello \d+/i) { ... }
Python
if re.search(r'hello \d+', text, re.I):
    ...
C++
if (regex_search(text, boost::regex("hello \\d+",
    boost::regex_constants::icase))) { ... }
C#
if (Regex.isMatch(text, @"hello \d+", RegexOptions.IgnoreCase)) { ... }
JavaScript
if (/hello \d+/i.test(s)) { ... }

Storing a regex for later use

Perl
$r = qr/hello \d+/i;
if ($text =~ $r) { ... }
Python
r = re.compile(r'hello \d+', re.I)
if r.search(text):
    ...

Note that Python automatically caches the most recently used patterns (see here and here), so you won't necessarily see a performance gain by compiling a regex.

C++
const boost::regex r("hello \\d+", boost::regex_constants::icase);
if (regex_search(text, r)) { ... }
C#
Regex r = new Regex(@"hello \d", RegexOptions.IgnoreCase);
if (r.IsMatch(s)) { ... }

Note that NET automatically caches the most recently used patterns, so you won't necessarily see a performance gain by storing a regex for later use. Also note that, while most other languages define “compiling” a regex as interpreting it, .NET supports compiling regexes to actual IL, as described here and here.

JavaScript
var r = /hello \d+/i;
// or var r = new RegExp("hello \\d+", "i");
if (r.test(s)) { ... }

Replacing part of a string

Perl
# Replace all occurrences:
$test =~ s/Hello/Goodbye/g;
# Replace the first occurrence only:
$text =~ s/Hello/Goodbye/;
Python
# Replace all occurrences:
text = re.sub('Hello', 'Goodbye', text)
# Replace the first occurrence only:
text = re.sub('Hello', 'Goodbye', text, count=1)
C++
// Replace all occurrences:
text = regex_replace(text, boost::regex("Hello"), "Goodbye");
// Replace the first occurrence only:
text = regex_replace(text, boost::regex("Hello"), "Goodbye",
    boost::regex_constants::format_first_only);
C#
// Replace all occurrences:
text = Regex.Replace(text, "Hello", "Goodbye");
// Replace the first occurrence only:
Regex r = new Regex("Hello");
text = r.Replace(text, "Goodbye", 1);
JavaScript
// Replace all occurrences:
text = text.replace(/Hello/g, 'Goodbye');
// Replace the first occurrence only:
text = text.replace(/Hello/, 'Goodbye');

Extracting parts of a string

Perl
if (($title, $name) = $text =~ /(Mr\.|Mrs\.|Dr\.) (\w+)/) { ... }
Python
m = re.search(r'(Mr\.|Mrs\.|Dr\.) (\w+)', text)
if m:
    title, name = m.groups()
    ...
C++
boost::smatch m;
if (regex_search(text, m, boost::regex("(Mr\\.|Mrs\\.|Dr\\.) (\\w+)"))) {
    const std::string& title = m[1].str();
    const std::string& name = m[2].str();
    ...
}
C#
Match m = Regex.Match(text, @"(Mr\.|Mrs\.|Dr\.) (\w+)");
if (m.Success) {
    string title = m.Groups[1].Value;
    string name = m.Groups[2].Value;
    ...
    ...
}
JavaScript
var match = /(Mr\.|Mrs\.|Dr\.) (\w+)/.exec(text);
if (match !== null) {
    var title = match[1];
    var name = match[2];
    ...
}

Thursday, January 28, 2010

Automated Image Editing for the Artistically Unskilled

GIMP is perhaps the poster child for bad open source usability. This may not be entirely fair; the UI is better than it used to be, and GIMP now has excellent docs, so a minute with Google will teach you how to do anything that the UI fails to make clear.

As a software developer, I firmly believe that anything worth doing more than once is worth automating, but scripting was one part of GIMP that remained opaque to me. GIMP's built-in scripting language, Scheme, is one that's unfamiliar to many programmers. And although GIMP extensions offer scripting support for Perl and Python, I've always had inordinate trouble getting those to work. (The Linux distros I've used often omit Perl support, for example, and getting GIMP's Python support to work on Windows requires manually installing several supporting Python libraries, of the correct versions, without any diagnostics if you get any of it wrong, and ideally without messing up existing installations of your Python libraries. YMMV.) I finally decided that all of this futzing with scripting extensions would be harder than just learning Scheme and that I should bite the bullet and learn to script GIMP natively. (Besides, I'm told that learning Scheme or Common Lisp will induce a "profound enlightenment experience", so it must be worth doing, right?)

Learning to Script

  • GIMP has extensive online documentation and tutorials. Its Batch Mode tutorial includes some scripting examples. There's even a Script-Fu Tutorial that provides a brief introduction to Scheme, for those of us who haven't yet worked our way through SICP.
  • GIMP also has a built-in Procedure Browser (under the Help menu) for looking up the GIMP-specific functions that you'll use in your Script-Fu scripts. This will be your primary reference after the above tutorials take care of the basics.

Unfortunately, none of the above references have much information on how to do non-image-processing tasks (such as making up filenames for your images) in Script-Fu. For that, you'll need one of the following sources:

  • GIMP uses TinyScheme as its scripting language. (Older versions of GIMP used SIOD, and there are still a few references to this on the web. Interestingly, GIMP doesn't use GUILE, even though GIMP and Guile are both GNU projects and Guile is the GNU Scheme interpreter.)
  • TinyScheme follows R5RS (the almost-current Scheme standard). Standard and builtin functions are documented in R5RS. Script-Fu extensions to TinyScheme are documented on the Tiny-Fu Wiki.

Running Your Script

The official GIMP documentation would have you place your script in a GIMP configuration directory, register it, and invoke it from the command line. This is unnecessarily complicated. Redirection from the command line is much simpler:

"c:\Program Files\GIMP-2.0\bin\gimp-console-2.6.exe" -c -b - < gimp-create-disabled-icons.scm

Ignore any wire errors that GIMP outputs at this point; those are harmless issues between GIMP and its various plugins and have nothing to do with your script.

Debugging Your Script

Error reporting from console batch mode is apparently nonexistent. This, combined with GIMP's slow startup times, makes debugging a frustrating experience. One approach is the tried-and-true debugging-via-printf approach using the gimp-message function ((gimp-message "Okay, I made it to line 5\n")).

A better approach is to use the Script-Fu console. To do so, go under the Filters menu, under Script-Fu, and choose Console. Then start typing in commands from your candidate script and seeing how they work out.

If you've poked around in the Procedure Browser at all, you've noticed a lot of procedures that take parameters of type IMAGE, LAYER, DRAWABLE, and so on. These are simply integers (presumably indexes into GIMP's data structures), so trying out these procedures is relatively easy. Consider the following Script-Fu Console session. (My commands are marked with > and comments are marked with > ;.)

> (gimp-image-list)
(1 #( 2 ))
> ; This tells me I have one image loaded, and its number is two.
> (gimp-image-get layers 2)
(1 #( 4 ))
> ; This tells me Image 2 has one layer, and its number is 4.
> ; Each LAYER is also a DRAWABLE, so I can test out whatever commands
> ; I want to now.
> (gimp-drawable-fill 4 WHITE-FILL)
(#t)
> ; (#t) means this command has no particular return value.

Most procedures executed in the Script-Fu console are entered into undo history just like GUI operations, so it's easy to clean up after mistakes.

Additional Resources

  • The GIMP Plugin Registry has plenty of example code. (Keep in mind that GIMP scripts are probably not the best source for learning good Scheme programming style.)
  • Samuel A. Rebelsky and Ian Bone-Rundle have put together the "Glimmer Utilities for Gimp". Their float-array procedure is useful for Script-Fu procedures that take a FLOATARRAY parameter.

Putting It into Practice

Automating the creation of the various icon sizes required by Windows Vista and newer is a good idea. The iconify.scm script is an example of this.

Creating proper disabled icons for your desktop application can enhance its look. (The OpenOffice web site has good examples of what disabled icons should look like.) Automating the creation of disabled icons in GIMP involves converting the icon to monochrome then adjusting its shading and/or opacity to your liking.

Tuesday, December 15, 2009

Open Source License Compliance

Complying with license agreements for closed source software libraries is pretty straightforward. You pay your money, and sometimes it's a lot of money, and sometimes you have to go through the odious "talk with the salesperson before you even find out how much money it is." Eventually, in spite of these obstacles, you get your requested number of licenses, and as long as you don't exceed that count, and as long as you don't do something stupid like post the licensed source code to your company web site, you're fine. The vendor doesn't usually care about how you attribute their work or whether you statically or dynamically link their code or how you patch it to meet your needs (if they provide source) or what other libraries you're using.

Complying with license agreements for open source software libraries can be significantly more complicated. While a lot has been written about the pros and cons of various open source licenses, their compatibility, their proliferation, and so on, there's relatively little guidance on how to comply, in practical terms, with the licenses of libraries that you use in software development. (Kudos to the Create Commons group for their efforts to make licensing so easy to understand; however, most of their effort focuses on cultural works and not software.) Much of what does cover license compliance focuses heavily on the corporate legal / risk management perspective, such as this article by Bruce Perens:

The law team... has to operate differently than they're used to. They have to be conscious that engineers are scored on meeting a timeline, and can't stop development while they're waiting for an answer from legal. Thus, legal has to be able to take on-impulse calls from engineering, and answer them in a timely fashion.

That's too expensive!, say the lawyers and managers. But it's part of the cost of doing business today. When I learned electronics, engineers built products by soldering together resistors and transistors. But today, the job of engineers is to build derivative works by combining units of intellectual property owned by third parties. That's not what they're trained for, and it's a mine-field of potential litigation for every company that puts software in its products - whether or not that software includes Open Source. Without an effective partnership between legal and engineering, your company walks that mine-field every day.

And while Bruce Perens obviously has a lot of experience in this area, I'm not convinced that it has to be this complicated for the average ISV or in-house developer who's using open source components in his day-to-day work. In a small attempt to reduce the complexity, here's a brief list of the most popular open source licenses, along with what it takes, in practical and oversimplified terms, to comply with each license in developing a closed source application. Of course, since I am not a lawyer and do not speak for my employer, this posting should in no way be construed as legal advice and is in fact quite possibly not worth the metaphorical paper it's printed on. (And metaphorical paper isn't worth much these days.) With that disclaimer out of the way, here's what you need to do:

  • Regardless of the license, avoid claiming that the entire software application is © You.
  • If you're shipping code under the Boost Software License: You have no additional obligations (as long as you're only shipping binaries).
  • If you're shipping code under a BSD-ish, MIT, or X11 license: Include a copy of the code's copyright notice and license.
  • If you're shipping GPL'ed code (version 2 and version 3): Don't, unless it's a standalone executable and you're merely bundling your app with it.
  • If you're shipping LGPL'ed code (version 2.1 and version 3): Make sure that you link dynamically rather than statically to the LGPL'ed code. (So in Windows, make sure that you use it as a DLL.) A few LPGL'ed libraries have specific exceptions for static linking under Windows, and if the dynamic linking requirement is particularly burdensome, there are possible workarounds. Also, the LGPL may require that you permit reverse engineering of your application for purposes of compatibility with future versions of the LGPL'ed code; see the full text of the LGPL or consult a lawyer for details.
  • Extra requirements if you're shipping GPL'ed or LGPL'ed code:
    1. Include attribution of the code's origin and its license, and include a copy of the license.
    2. Your physical media must either include a copy of the GPL'ed / LGPL'ed source or must be accompanied by a written offer to make the source available.
    3. If you distribute your application online, then you must make the source code available from the same download site to whoever can download your application.
    4. If you distribute your application online, and if either the GPL'ed or LGPL'ed code is released under version 3 of the GPL or it's released under an earlier version of those licenses with permission to use "(at your option) any later version" of those licenses, then instead of making the source code available yourself, you may provide a link to an upstream server hosting the code. (I find it easier to do #3 than to worry if #4's requirements are met.)
  • If you're shipping MPL'ed code: Include attribution of the source code and its license. Include information on where to find the source code.
  • If you're shipping code under the Apache Software License: Include attribution and a copy of the license.
  • If you're shipping code under the Eclipse Public License: Include attribution and a copy of the license. Include information on where to find the source code. Additionally, you are obligated to release your source code if it constitutes a derivative work of the EPL'ed software. There are no cut-and-dried criteria for what constitutes a derivative work by the EPL; see the EPL FAQ items 25, 26, and 27 for details.
  • If you're shipping GPL'ed, LGPL'ed, or MPL'ed code that you have modified: You must distribute a prominent description of your modifications (as part of the GPL'ed or LGPL'ed source distribution, and even if you wouldn't otherwise have to distribute MPL'ed source). Check the license for exact details regarding what form this description should take.
  • If you're using GPL'ed, LGPL'ed, MPL'ed, BSD-ish, MIT, X11, Apache, EPL, or Boost code for internal applications only: All of this is vastly simplified, since licenses like the GPL explicitly do not place any restrictions on internal use, and since many of these licenses' clauses apply only to those to whom the software is distribute. For example, you could likely use a GPL'ed unit test library with closed source software, since the unit tests are only distributed to developers who already have the source.
  • If you're using GPL'ed, LGPL'ed, MPL'ed, BSD-ish, MIT, X11, Apache, EPL, or Boost code to develop a web site: Again, all of this becomes vastly simplified, since the web site's code is only "distributed" to the web server. Only one or two licenses (like the Affero GPL) have special requirements for web sites.
  • If software patents are involved: Reread the licenses and ask a lawyer.

Quite a few of these license have some variation on attributing and including a copy of the license. Figuring out the exact attribution format required by each license can be tricky; a handy solution is to see how companies like Google, Sun, and VMware handle it, since their legal departments are much larger than most organizations'.

  • OpenOffice.org includes the following clause in its EULA: "Third Party Code. Additional copyright notices and license terms applicable to portions of the Software are set forth in the THIRDPARTYLICENSEREADME.html file." Within the third part license readme file, each library gets its own section with wording similar to the following:
    The following software may be included in this product:William's Wonderful Widgets; Use of any of this software is governed by the terms of the license below:
    Copies of the LGPL and GPL are included at the end.
  • Google Chrome includes a notice in its Help > About screen: "Google Chrome is made possible by the Chromium open source project and other open source software." "Open source software" here is a link bringing up a credits page that lists each component, its homepage, and its license.
  • VMware devotes a section of their web site to collecting all of their open source licenses and source code and includes a link to this site in their EULA. Installation directories also include license lists similar to OpenOffice.org's.

Update (12/17/2009):

A few extra resources regarding open source license compliance:

The timing of this post occurring shortly after news of a major GPL-related lawsuit was completely accidental.

Sunday, June 7, 2009

Standard C++

I have vague recollections of trying to track down a copy of the C++ standard several years ago. I went to the ISO's web site, went into sticker shock at the several hundred dollars they wanted for a CD or PDF, and soon forgot about it.

Until I came across this discussion on Stack Overflow, I didn't realize that the C++ standard was so easy to acquire. PDF versions are available for $30 from ANSI's web site and from Techstreet. If even $30 is too much, or if you want to read about cutting-edge C++ stuff, then you can download the C++0x draft standard from the ISO committee's web site for free, with the caveat that it is "incomplet and incorrekt." Copies of the C++98 and/or C++03 draft standards are also archived in various places around the web.

With the wealth of C++ information available on the web and in print, there are often more convenient resources than the standard, but topics such as the order in which integral types are promoted seem difficult to come by on the web, and for particularly arcane issues of C++ style, it's nice to have a more disciplined approach than "munging syntax until it works" and a more authoritative source than "whatever my compiler accepts." The C++ standard acts as a grammar textbook for the C++ language, and although the standard isn't exactly thrilling reading, its dialect of standardese isn't nearly as difficult to read as you might think.

Linguists will tell you that a natural language's grammar is descriptive, not prescriptive. In other words, English grammar isn't the list of rules that you find in a high school textbook, it's a description of how people actually talk and write, complete with colloquialisms, various "ungrammatical" speech, and, in the 21st century, emoticons and online acronyms. There's some merit to taking a similar approach with programming languages. Using a programming language effectively requires more than just knowing the language's formal constructs; it involves using the language's community's idioms (using and combining constructs to get things done, as in C++'s RAII) and taboos (aspects of the language that are too obscure or problematic to rely upon, like C++'s exception specifications or the relative precedence of || and &&). Moreover, what ultimately matters is how a language can be used in the field rather than how the standard says a language can be used; it's a small comfort to read that your template metaprogramming technique or C99 construct is permitted by the standard if your target platform rejects it.

The rules of grammar (in the prescriptive high school sense rather than the descriptive linguistic sense) are generally worth following, to promote clarity of communication, but skilled writers know that it's sometimes effective to break these rules, whether to create a particular effect (such as a conversational, informal tone) or because the rules are outdated or never made much sense in the first place (such as the prohibition against split infinitives). Similarly, there are times where it may be beneficial to break the C++ standard. Imperfect C++ gives two examples of this (with caveats as to their use): the "friendly template" technique (declaring a template parameter to be a friend of the template class; technically illegal but very widely supported) and nested or local functors (potentially very useful but rejected by almost all compilers). C99 features such as variable length arrays (supported by GCC) and variadic macros also fall in this category. Obviously, breaking the standard should only be done with full awareness of the consequences (primarily damage to future portability) and is usually a bad idea, but not always. For example, code using variadic macros completely violates the C++ standard, but it's likely much more portable than some of the more advanced uses of templates.

C++ might deserve mention for even having a (well-discussed, reasonably well-implemented) standard. Perl, for example, lacks any official standard; instead, its implementation and test suite serve as the de facto standard, and the need to remedy this may be part of the reason for Perl 6's delays. JavaScript has an international standard and yet continues to suffer from browser compatibility issues, not just in the DOM and event models, but in the language core:

var cmd, valve;
// Perl- or Python-style assignment to an array: legal in Firefox, but 
// rejected by Chrome
[ , cmd, valve ] = /(\w+)_valve_(\d+)/.exec(this.id);

// A C-style trailing comma in an initializer list: creates an array 
// of length 3 in Firefox, but IE creates an array of length 4 with 
// undefined as its last element
var fields = [ 'street', 'city', 'zip', ]

Fred Brooks has more to say about the risks of using an implementation as a standard in chapter 6 of The Mythical Man-Month, but by this point we're pretty far afield of the C++ standard. So, if you're a C++ developer, go download or buy yourself a copy and give it a look.

Sunday, May 17, 2009

Comparing Version Numbers

Comparing version numbers is one of those operations that seems like it ought to be simple but can have a fair bit of complexity under the surface. How would you write an algorithm that gives the correct result for all of the following?

  • 2.10 comes after 2.9
  • 2.11-beta comes after 2.11-alpha
  • 2.11 comes after 2.11-beta
  • 7 comes after 2003
  • the revision that was checked in on 1/30/2009 and never given a version number comes after the revision that was checked in on 8/1/2008 and never given a version number

The Debian Project deals with more version numbers than anyone - over 30,000 packages over more than fifteen years, each of which must be separately versioned and tracked for upgrade and depedency purposes - so it's no surprise that they've developed a policy and algorithm that can handle all of the above cases. Debian's algorithm neatly handles cases like 2.9 versus 2.10, and it has two main additions over other version comparison algorithms: it states that ~ is sorted before anything, even the empty string, so 2.11~beta comes before 2.11, and it permits the addition of an epoch to "reset" version numbers, so 2003, which has the default epoch of 0, comes before 1:7, which has an epoch of 1. All of this makes Debian's algorithm a good approach for comparing version numbers even if you're not a Debian package maintainer.

I recently needed to compare some Debian version numbers, but the only C or C++ implementations of this algorithm that I could find were GPL'ed, and my target environment wouldn't permit me to simply shell out to dpkg --compare-versions, so I wrote my own implementation. Here it is, under the public domain. Note that, although it's written as a C++ class, the method bodies are in C style to make it easier to reuse them in a non-C++ project.

DebVersion.h:

#ifndef DEBVERSIONH
#define DEBVERSIONH

#include <stdlib.h>
#include <iostream>
#include <boost/operators.hpp>

class DebVersion : public boost::totally_ordered<DebVersion> {
public:
  DebVersion() : mEpoch(0), mUpstream(NULL), mRevision(NULL) {}
  DebVersion(const DebVersion& other) { *this = other; }
  explicit DebVersion(const char *version);
  ~DebVersion();

  int Epoch() const { return mEpoch; }
  const char *Upstream() const { return mUpstream; }
  const char *Revision() const { return mRevision; }

  void Epoch(int new_epoch) { mEpoch = new_epoch; }
  void Upstream(const char *new_upstream) { free(mUpstream); mUpstream = strdup(new_upstream); }
  void Revision(const char *new_revision) { free(mRevision); mRevision = strdup(new_revision); }
  void ClearRevision() { Revision("0"); }

  DebVersion& operator=(const DebVersion& other);
  bool operator<(const DebVersion& other) const;
  bool operator==(const DebVersion& other) const;

protected:
  int mEpoch;
  char *mUpstream;
  char *mRevision;

  static int CompareComponent(const char *a, const char *b);
};

inline DebVersion::~DebVersion()
{
  free(mUpstream);
  free(mRevision);
}

inline bool DebVersion::operator==(const DebVersion& other) const
{
  return Epoch() == other.Epoch()
    && strcmp(Upstream(), other.Upstream()) == 0
    && strcmp(Revision(), other.Revision()) == 0;
}

inline DebVersion& DebVersion::operator=(const DebVersion& other)
{
  mEpoch = other.Epoch();
  mUpstream = strdup(other.Upstream());
  mRevision = strdup(other.Revision());
  return *this;
}

void operator<<(std::ostream& o, const DebVersion& ver);

#endif

DebVersion.cpp:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

#include "DebVersion.h"

void operator<<(std::ostream& o, const DebVersion& ver)
{
  if (ver.Epoch() != 0) {
    o << ver.Epoch() << ":";
  }
  o << ver.Upstream();
  if (0 != strcmp(ver.Revision(), "0")) {
    o << "-" << ver.Revision();
  }
}

DebVersion::DebVersion(const char *version)
{
  // Extract epoch.  The Debian Policy Manual says that epoch must be
  // numeric, so atoi is safe.
  const char *epoch_end = strchr(version, ':');
  if (epoch_end != NULL) {
    mEpoch = atoi(version);
  } else {
    mEpoch = 0;
  }

  const char *upstream_start;
  if (epoch_end) {
    upstream_start = epoch_end + 1;
  } else {
    upstream_start = version;
  }

  const char *upstream_end = strrchr(upstream_start, '-');
  size_t upstream_size;
  if (upstream_end == NULL) {
    upstream_size = strlen(upstream_start);
  } else {
    upstream_size = upstream_end - upstream_start;
  }

  mUpstream = (char*) malloc(upstream_size + 1);
  strncpy(mUpstream, upstream_start, upstream_size);
  mUpstream[upstream_size] = '\0';

  if (upstream_end == NULL) {
    mRevision = strdup("0");
  } else {
    mRevision = strdup(upstream_end + 1);
  }
}

/**Compare two components of a Debian-style version.  Return -1, 0, or 1
 * if a is less than, equal to, or greater than b, respectively.
 */
int DebVersion::CompareComponent(const char *a, const char *b)
{
  while (*a && *b) {

    while (*a && *b && !isdigit(*a) && !isdigit(*b)) {
      if (*a != *b) {
        if (*a == '~') return -1;
        if (*b == '~') return 1;
        return *a < *b ? -1 : 1;
      }
      a++;
      b++;
    }
    if (*a && *b && (!isdigit(*a) || !isdigit(*b))) {
      if (*a == '~') return -1;
      if (*b == '~') return 1;
      return isdigit(*a) ? -1 : 1;
    }

    char *next_a, *next_b;
    long int num_a = strtol(a, &next_a, 10);
    long int num_b = strtol(b, &next_b, 10);
    if (num_a != num_b) {
      return num_a < num_b ? -1 : 1;
    }
    a = next_a;
    b = next_b;

  }
  if (!*a && !*b) {
    return 0;
  } else if (*a) {
    return *a == '~' ? -1 : 1;
  } else {
    return *b == '~' ? 1 : -1;
  }
}

bool DebVersion::operator<(const DebVersion& other) const
{
  if (Epoch() != other.Epoch()) {
    return Epoch() < other.Epoch();
  }

  int result = CompareComponent(Upstream(), other.Upstream());
  if (result) {
    return -1 == result;
  }

  return -1 == CompareComponent(Revision(), other.Revision());
}

And, because no code is complete without some good unit tests, here's DebVersionTest.cpp, using the excellent Google Test framework:

#include <gtest/gtest.h>

#include "DebVersion.h"

TEST(DebVersionTest, ParsesUpstreamOnly)
{
  DebVersion v("1");
  EXPECT_EQ(0, v.Epoch());
  EXPECT_STREQ("1", v.Upstream());
  EXPECT_STREQ("0", v.Revision());
}

TEST(DebVersionTest, ParsesEpochs)
{
  DebVersion v("1:2.0.1");
  EXPECT_EQ(1, v.Epoch());
  EXPECT_STREQ("2.0.1", v.Upstream());
  EXPECT_STREQ("0", v.Revision());
}

TEST(DebVersionTest, ParsesRevisions)
{
  DebVersion v("10:4.0.1~alpha-4-5");
  EXPECT_EQ(10, v.Epoch());
  EXPECT_STREQ("4.0.1~alpha-4", v.Upstream());
  EXPECT_STREQ("5", v.Revision());
}

TEST(DebVersionTest, ComparesNumbers)
{
  EXPECT_LT(DebVersion("1"), DebVersion("2"));
  EXPECT_EQ(DebVersion("10"), DebVersion("10"));
  EXPECT_LT(DebVersion("9"), DebVersion("10"));
  EXPECT_GT(DebVersion("10"), DebVersion("9"));
}

TEST(DebVersionTest, ComparesEpochs)
{
  EXPECT_GT(DebVersion("2:1"), DebVersion("1:2"));
  EXPECT_LT(DebVersion("10"), DebVersion("1:2"));
}

TEST(DebVersionTest, ComparesAlphas)
{
  EXPECT_LT(DebVersion("alpha"), DebVersion("beta"));
  EXPECT_LT(DebVersion("alpha1"), DebVersion("alpha2"));
  EXPECT_GT(DebVersion("alpha10"), DebVersion("alpha2"));
}

TEST(DebVersionTest, ComparesTildes)
{
  EXPECT_LT(DebVersion("3.0~beta1"), DebVersion("3.0"));
  EXPECT_GT(DebVersion("3.0~beta"), DebVersion("3.0~~prebeta"));
  EXPECT_LT(DebVersion("3.0~beta4"), DebVersion("3.0~rc1"));
}

TEST(DebVersionTest, ComparesRevisions)
{
  EXPECT_LT(DebVersion("3.0-2"), DebVersion("3.0-10"));
}

Monday, May 11, 2009

Minding Your Manners

I recently went to a book sale where shoppers could browse countless bins of completely disorganized books and buy as many as could fit in a box for $30. I quickly learned that the only way to navigate a sale like this was to disregard my traditional short list of topics and authors that I follow, grab anything remotely interesting, and sort it out later. One of my finds was New Rules @ Work, by Barbara Pachter and Ellen Coleman.

book cover

The book's subtitle sums up its contents: "79 Etiquette Tips, Tools, and Techniques to Get Ahead and Stay Ahead." It covers topics such as greeting people, navigating social gatherings, corporate attire, and communications skills (whether face-to-face or by phone or email). For me, at least, the tips presented are an odd mix of the very obvious (stand up when shaking a guest's hand), the rather irrelevant (how to sample wine at restaurants fancier than my humble software developer's expense account would cover), and the desperately needed (how to mingle at a social gathering). Even though many of the suggestions fell in the first category for me, the book is a quick read and is interspersed with plenty of amusing anecdotes about various etiquette blunders to keep it interesting.

All of this has little to do with coding, but there's a lot more to software development than just coding:

  • gathering requirements
  • managing the infrastructure (version control, build system, continuous integration)
  • documentation (developer docs if not end-user docs)
  • testing
  • release management and deployment
  • customer support (possibly indirectly, depending on the size of your organization)

And there's a lot more to a career in software development than just software development:

And, of course, workplace etiquette. New Rules @ Work emphasizes the effects that etiquette (or the lack thereof) can have on your career, and it seems to present etiquette primarily as a tool for advancing your career. Considering the book's focus, this is probably okay, but it seems a bit utilitarian to me. I've always preferred Brendan Frasier's character's explanation in the movie Blast from the Past:

Good manners are just a way of showing other people we have respect for them... A lady or a gentleman is someone who always tries to make sure the people around him or her are as comfortable as possible.

Etiquette is a way to serve others.

As software developers, it can be our tendency to focus too much on the technology. We want to try out the latest development tools or techniques even if they're not best for the job; we pursue certain standards or ideals even if they don't make business sense; we see a technical challenge and try to solve it even if it's not the right problem to solve. As a remedy to this, we're told to focus on delivering business value. This is good advice, but it raises the question, why pursue business value? The standard answer is "profits," but as Max De Pree says, "Profits are like breathing. Breathing is not the goal of life, but it is pretty good evidence of whether or not you are alive." The goal of business - the reason that business can exist, the way that business makes profits - is service: meeting people's needs by offering goods or services.

There's a story that, during the initial development of the Apple Macintosh, Steve Jobs complained that it booted too slowly. He argued that shaving ten seconds off of the boot time, multiplied by millions of users, would save several lifetimes. My software may never save a lifetime, but every feature that I add, every bug that I fix, every UI interaction that I streamline means that users can do their jobs a bit faster, a bit easier, with a bit less stress.

Software development is about serving users. Etiquette is just a way to do in the small what our businesses and our careers do in the large.

Sunday, May 3, 2009

Memberwise Visitation

This code was the original impetus for my postings on template template parameters and multi-paradigm programming, but those didactic digressions grew into articles of their own. I don't know how helpful such digressions are - I'm learning that truly effective teaching requires the patience to develop good examples and an awareness of where your target audience is, and I'm not sure that I possess either, at least in this format - but at least they're out of the way...

C++'s object-oriented and generic programming each offer a lot of capability, but they can be combined for even more power. Consider, for example, the following code:

#include <iostream>
#include <boost/ptr_container/ptr_vector.hpp>
#include <boost/shared_ptr.hpp>

template <typename Class>
class ClassVisitor {
public:

  class MemberVisitor {
  public:
    virtual ~MemberVisitor() {}
    virtual void Visit(Class& instance) = 0;
  };

  template <typename T>
  class MemberVisitorImpl : public MemberVisitor {
  public:
    MemberVisitorImpl(T Class::*ptr) : mMember(ptr) {}
    void Visit(Class& instance) {
      // Sample visitor method.  You can define whatever member-specific
      // behavior you want.
      std::cout << instance.*mMember << ' ';
    }
  private:
    T Class::*mMember;
  };

  template <typename T>
  ClassVisitor& Declare(T Class::*ptr);

  void Visit(Class& instance);

private:
  typedef boost::ptr_vector<MemberVisitor> MemberVisitorList;
  MemberVisitorList mMemberVisitors;
};

template <typename Class>
template <typename T>
ClassVisitor<Class>& ClassVisitor<Class>::Declare(T Class::*ptr)
{
  mMemberVisitors.push_back(new MemberVisitorImpl<T>(ptr));
  return *this; // supports a fluent interface
}

template <typename Class>
void ClassVisitor<Class>::Visit(Class& instance)
{
  for (typename MemberVisitorList::iterator i = mMemberVisitors.begin();
       i != mMemberVisitors.end();
       i++) {
    i->Visit(instance);
  }
}

struct Annotation {
  const char *mMessage;
  int mX;
  int mY;
};

int main(int argc, char *argv[])
{
  ClassVisitor<Annotation> def;
  def.Declare(&Annotation::Message)
    .Declare(&Annotation::X)
 .Declare(&Annotation::Y);

  Annotation annotation;
  annotation.Message = "Note this";
  annotation.X = 10;
  annotation.Y = 20;
  def.Visit(annotation);
  return 0;
}

This code offers one approach to visiting each member of a class with a client-specified operation. Generic programming is used MemberVisitor subclasses handle members of any type, while OOP's polymorphism is used to let different types be handled at runtime. (Normally templates can only be applied at compile-time.) The example code simply dumps member variables' contents to screen and ends up being merely an overcomplicated operator<<(std::ostream&, const Annotation&), but MemberVisitor instances can contain additional information about their member variables and can define more than one Visit operation. In my current code's memberwise visitor implementation:

  • MemberVisitor instances contain member variables' descriptions, defaults, and constraints (minimum, maximum, etc.).
  • Copy constructors, operator=, and default constructors can be implemented by iterating over the list of member variables and default values that the MemberVisitor list provides.
  • MemberVisitors' information is shared by visit operations: per-member serialization and unserialization use members' types and formats, constraints are used during unserialization to check for corrupt archives and at runtime to check design by contract-style invariants, member descriptions and defaults can be used to report errors and reset to a known good state if a corrupt archive is encountered during deserialization, and the class can easily be dumped one member at a time to a human-readable format for inspection and debugging.
  • A second set of visitors handles binding members to a form's fields: loading form fields from a class instance, checking whether a form is dirty, validating a form, and saving form fields to a class instance are each implemented as Visit operations, and information such as MemberVisitor constraints can be used when a form is loaded to configure the UI widgets and before a form is saved to check for validity.

The result is that the MemberVisitors offer some introspection/reflection capability for C++ as well as something like a database schema for each class, similar to what would be provided by most frameworks' database support, but not tied in to a database backend. This multiparadigm (OO-and-generic) combination is only one approach to implementing this functionality; other implementations could use more of a pure template metaprogramming approach or could rely on preprocessor macros, code generation via an external tool, or reflection (via vendor extensions or a third-party library). Each implementation offers its own tradeoffs for complexity, portability (template metaprogramming often runs into compiler bugs; reflection is often vendor-specific), and build system support (in the case of code generation and reflection).

I'm still playing around with different approaches trying to find the best balance of tradeoffs and trying to figure out the best way to take advantage of C++'s multiparadigm support without overcomplicating my code. Are there other examples of code that directly leverages C++'s multiparadigm support? Are there preexisting implementations of this kind of member schema functionality for C++? Would there be any interest in a (cleaned up and open-sourced) member schema library, or is that functionality adequately covered by database libraries? Comments welcome.