Files
popcyclical-blog-archive/posts/2010-09-11-splitting-pascalcamel-case-with-regex-enhancements.md

46 lines
3.1 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Splitting Pascal/Camel Case with RegEx Enhancements"
date: 2010-09-11T21:03:15.087-05:00
slug: splitting-pascalcamel-case-with-regex-enhancements
published: true
---
In [Jon Galloways](http://weblogs.asp.net/jgalloway/) [Splitting Camel Case with RegEx](http://weblogs.asp.net/jgalloway/archive/2005/09/27/426087.aspx) blog post, he introduced a simple regular expression replacement which can split “ThisIsInPascalCase” into “This Is In Pascal Case”.  Heres the original code:
```
output = System.Text.RegularExpressions.Regex.Replace(
input,
"([A-Z])",
" $1",
System.Text.RegularExpressions.RegexOptions.Compiled).Trim();
```
Simple and effective.  Matches any capital letters and inserts a space before them.  But theres room for improvement.  First, the call to `String.Trim()` to remove any spaces potentially added if the first letter is uppercase this can be handled with a [“Match if prefix is absent” group](http://msdn.microsoft.com/en-us/library/az24scfc.aspx#grouping_constructs) containing the “beginning of line” character `^`.  This prevents any matches from occurring on the first character, which eliminates the need for the `String.Trim()` call.  The formal name for this grouping construct is “Zero-width negative lookbehind assertion”, but just think of it as “if you see whats in here, dont match the next thing”.
```
(?<!^)([A-Z])
```
Next - theres a potential issue with how acronyms get handled with this.  Given this fictional book title: “WCFForNoobs” the split will occur on each uppercase letter resulting in “W C F For Noobs”.  The fix is simple, though require that uppercase letters be followed by a lowercase:
```
(?<!^)([A-Z][a-z])
```
…Now itll result in “WCF For Noobs” (arent we all!).  But now it wont add a space before the acronym for “LearnWCFInSixEasyMonths”, the result will be “LearnWCF In Six Easy Months”.  No problem add an alternate match for a lowercase letter coming before the uppercase letter.  The replace pattern makes this more difficult we dont want the space to go before the lowercase letter, we want it between the lowercase and the first capital letter of the acronym.  RegEx can handle this with another lookbehind match group “Match prefix but exclude it” - `(?<=)`.  This allows the match to occur on the lowercase-uppercase pair, but only the uppercase portion will get matched, so when it comes time to run the replacement, the space will get inserted between the two letters.  By itself, thatll look like this:
```
((?<=[a-z])[A-Z])
```
Great!  But this needs to be combined with previous expression.  Easy accomplished with an either/or match using the vertical bar “or” construct:
```
(?<!^)([A-Z][a-z]|(?<=[a-z])[A-Z])
```
The example “LearnWCFInSixEasyMonths” will now be split into “Learn WCF In Six Easy Months”.  These same techniques can be used for additional splits perhaps on numbers or underscores.  More generally, [lookbehind and lookahead are great tools](http://www.regular-expressions.info/lookaround.html) to have in your RegEx toolbelt.