Have you ever thought how powerful Excel would be if someone could enrich its toolbox with regular expressions? We have not only thought but worked on it :) And now, you can add this wonderful RegEx function to your own workbooks and wipe out substrings matching a pattern in no time!
Last week, we looked at how to use regular expressions to replace strings in Excel. For this, we created a custom Regex Replace function. As it turned out, the function goes beyond its primary use and can not only replace strings but also remove them. How could that be? In terms of Excel, removing a value is nothing else but replacing it with an empty string, something that our Regex function is very good at!
VBA RegExp function to remove substrings in Excel
As we all know, regular expressions are not supported in Excel by default. To enable them, you need to create your own user-defined function. The good news is that such a function is already written, tested, and ready for use. All you have to do is to copy this code, paste it in your VBA editor, and then save your file as a macro-enabled workbook (.xlsm).
The function has the following syntax:
The first three arguments are required, the last two are optional.
Where:
- Text - the text string to search in.
- Pattern - the regular expression to search for.
- Replacement - the text to replace with. To remove substrings matching the pattern, use an empty string ("") for replacement.
- Instance_num (optional) - the instance to replace. If omitted, all found matches are replaced (default).
- Match_case (optional) - a Boolean value indicating whether to match or ignore text case. For case-sensitive matching, use TRUE (default); for case-insensitive - FALSE.
For more information, please see RegExpReplace function.
Tip. In simple cases, you can remove specific characters or words from cells with Excel formulas. But regular expressions provide a lot more options for this.
How to remove strings using regular expressions - examples
As mentioned above, to remove parts of text matching a pattern, you are to replace them with an empty string. So, a generic formula takes this shape:
The below examples show various implementations of this basic concept.
Remove all matches or specific match
The RegExpReplace function is designed to find all substrings matching a given regex. Which occurrences to remove is controlled by the 4th optional argument, named instance_num.
The default is "all matches" - when the instance_num argument is omitted, all found matches are removed. To delete a specific match, define the instance number.
In the below strings, suppose you want to delete the first order number. All such numbers start with the hash sign (#) and contain exactly 5 digits. So, we can identify them using this regex:
Pattern: #\d{5}\b
The word boundary \b specifies that a matching substring cannot be part of a bigger string such as #10000001.
To remove all matches, the instance_num argument is not defined:
=RegExpReplace(A5, "#\d{5}\b", "")
To eradicate only the first occurrence, we set the instance_num argument to 1:
=RegExpReplace(A5, "#\d{5}\b", "", 1)
Regex to remove certain characters
To strip off certain characters from a string, just write down all unwanted characters and separate them with a vertical bar | which acts as an OR operator in regexes.
For instance, to standardize telephone numbers written in various formats, first we get rid of specific characters such as parentheses, hyphens, dots and whitespaces.
Pattern: \(|\)|-|\.|\s
=RegExpReplace(A5, "\(|\)|-|\.|\s", "")
The result of this operation is a 10-digit number like "1234567890".
For convenience, you can enter the regex is a separate cell, and refer to that cell using an absolute reference such as $A$2:
=RegExpReplace(A5, $A$2, "")
And then, you can standardize the formatting the way you want by using the concatenation operator (&) and Text functions such as RIGHT, MID and LEFT.
For example, to write all phone numbers in the (123) 456-7890 format, the formula is:
="("&LEFT(B5, 3)&") "&MID(B5, 4, 3)&"-"&RIGHT(B5, 4)
Where B5 is the output of the RegExpReplace function.
Remove special characters using regex
In one of our tutorials, we looked at how to remove unwanted characters in Excel using inbuilt and custom functions. Regular expressions make things a lot easier! Instead of listing all the characters to delete, just specify the ones you want to keep :)
The pattern is based on negated character classes - a caret is put inside a character class [^ ] to match any single character NOT in brackets. The + quantifier forces it to regard consecutive characters as a single match, so that a replacement is done for a matching substring rather than for each individual character.
Depending on your needs, choose one of the following regexes.
To remove non-alphanumeric characters, i.e. all characters except letters and digits:
Pattern: [^0-9a-zA-Z]+
To purge all characters except letters, digits and spaces:
Pattern: [^0-9a-zA-Z ]+
To delete all characters except letters, digits and underscore, you can use \W that stands for any character that is NOT alphanumeric character or underscore:
Pattern: \W+
If you want to keep some other characters, e.g. punctuation marks, put them inside the brackets.
For instance, to strip off any character other than a letter, digit, period, comma, or space, use the following regex:
Pattern: [^0-9a-zA-Z\., ]+
This successfully eliminates all special characters, but extra whitespace remains.
To fix this, you can nest the above function into another one that replaces multiple spaces with a single space character.
=RegExpReplace(RegExpReplace(A5,$A$2,""), " +", " ")
Or just use the native TRIM function with the same effect:
=TRIM(RegExpReplace(A5, $A$2, ""))
Regex to remove non-numeric characters
To delete all non-numeric characters from a string, you can use either this long formula or one of the very simple regexes listed below.
Match any character that is NOT a digit:
Pattern: \D+
Strip non-numeric characters using negated classes:
Pattern: [^0-9]+
Pattern: [^\d]+
Tip. If your goal is to remove text and spill the remaining numbers into separate cells or place them all in one cell separated with a specified delimiter, then use the RegExpExtract function as explained in How to extract numbers from string using regular expressions.
Regex to remove everything after space
To wipe out everything after a space, use either the space ( ) or whitespace (\s) character to find the first space and .* to match any characters after it.
If you have single-line strings that only contain normal spaces (value 32 in the 7-bit ASCII system), it does not really matter which of the below regexes you use. In case of multi-line strings, it does make a difference.
To remove everything after a space character, use this regex:
Pattern: " .*"
=RegExpReplace(A5, " .*", "")
This formula will strip anything after the first space in each line. For the results to display correctly, be sure to turn Wrap Text on.
To strip off everything after a whitespace (including a space, tab, carriage return and new line), the regex is:
Pattern: \s.*
=RegExpReplace(A5, "\s.*", "")
Because \s matches a few different whitespace types including a new line (\n), this formula deletes everything after the first space in a cell, no matter how many lines there are in it.
Regex to remove text after specific character
Using the methods from the previous example, you can eradicate text after any character that you specify.
To handle each line separately:
Generic pattern: char.*
In single-line strings, this will remove everything after char. In multi-line strings, each line will be processed individually because in the VBA Regex flavor, a period (.) matches any character except a new line.
To process all lines as a single string:
Generic pattern: char(.|\n)*
To delete anything after a given character, including new lines, \n is added to the pattern.
For example, to remove text after the first comma in a string, try these regular expressions:
Pattern: ,.*
Pattern: ,(.|\n)*
In the screenshot below, you can examine how the outcomes differ.
Regex to remove everything before space
When working with long strings of text, you may sometimes want to make them shorter by removing the same part of information in all cells. Below we'll discuss two such cases.
Remove everything before the last space
As with the previous example, a regular expression depends on your understanding of a "space".
To match anything up to the last space, this regex will do (quotation marks are added to make a space after an asterisk noticeable).
Pattern: ".* "
To match anything before the last whitespace (including a space, tab, carriage return, and new line), use this regular expression.
Pattern: .*\s
The difference is especially noticeable on multi-line strings.
Strip everything before the first space
To match anything up to the first space in a string, you can use this regular expression:
Pattern: ^[^ ]* +
From the start of a string ^, we match zero or more non-space characters [^ ]* that are immediately followed by one or more spaces " +". The last part is added to prevent potential leading spaces in the results.
To remove text before first space in each line, the formula is written in the default "all matches" mode (instance_num omitted):
=RegExpReplace(A5, "^[^ ]* +", "")
To delete text before the first space in the first line, and leave all other lines intact, the instance_num argument is set to 1:
=RegExpReplace(A5, "^[^ ]* +", "", 1)
Regex to strip off everything before character
The easiest way to remove all text before a specific character is by using a regex like this:
Generic pattern: ^[^char]*char
Translated into a human language, it says: "from the start of a string anchored by ^, match 0 or more characters except char [^char]* up to the first occurrence of char.
For example, to delete all text before the first colon, use this regular expression:
Pattern: ^[^:]*:
To avoid leading spaces in the results, add a whitespace character \s* to the end. This will remove everything before the first colon and trim any spaces right after it:
Pattern: ^[^:]*:\s*
=RegExpReplace(A5, "^[^:]*:\s*", "")
Tip. Besides regular expressions, Excel has its own means to remove text by position or match. To learn how to accomplish the task with native formulas, please see How to remove text before or after a character in Excel.
Regex to remove everything except
To eradicate all characters from a string except the ones you want to keep, use negated character classes.
For instance, to remove all characters except lowercase letters and dots, the regex is:
Pattern: [^a-z\.]+
In fact, we could do without the + quantifier here as our function replaces all found matches. The quantifier just makes it a little faster - instead of handling each individual character, you replace a substring.
=RegExpReplace(A5, "[^a-z\.]+", "")
Regex to remove html tags in Excel
First off, it should be noted that HTML is not a regular language, so parsing it using regular expressions is not the best way. That said, regexes can definitely help strip tags out of your cells to make your dataset cleaner.
Given that html tags are always placed within angle brackets <>, you can find them using one of the following regexes.
Negated class:
Pattern: <[^>]*>
Here, we match an opening angle bracket, followed by zero or more occurrences of any character except the closing angle bracket [^>]* up to the nearest closing angle bracket.
Lazy search:
Pattern: <.*?>
Here, we match anything from the first opening bracket to the first closing bracket. The question mark forces .* to match as few characters as possible until it finds a closing bracket.
Whichever pattern you choose, the result will be absolutely the same.
For example, to remove all html tags from a string in A5 and leave text, the formula is:
=RegExpReplace(A5, "<[^>]*>", "")
Or you can use the lazy quantifier as shown in the screenshot:
This solution works perfectly for single text (rows 5 - 9). For multiple texts (rows 10 - 12), the results are questionable - texts from different tags are merged into one. Is this correct or not? I'm afraid, it's not something that can be easily decided - all depends on your understanding of the desired outcome. For example, in B11, the result "A1" is expected; while in B10, you might want "data1" and "data2" to be separated with a space.
To remove html tags and separate the remaining texts with spaces, you can proceed in this way:
- Replace tags with spaces " ", not empty strings:
=RegExpReplace(A5, "<[^>]*>", " ")
- Reduce multiple spaces to a single space character:
=RegExpReplace(RegExpReplace(A5, "<[^>]*>", " "), " +", " ")
- Trim leading and trailing spaces:
=TRIM(RegExpReplace(RegExpReplace(A5, "<[^>]*>", " "), " +", " "))
The result will look something like this:
Ablebits Regex Remove Tool
If you've had a chance to use our Ultimate Suite for Excel, you probably already discovered the new Regex Tools introduced with the recent release. The beauty of these .NET based Regex functions is that they, firstly, support full-featured regular expression syntax free of VBA RegExp limitations, and secondly, do not require inserting any VBA code in your workbooks as all code integration is done by us at the backend.
Your part of the job is to construct a regular expression and serve it to the function :) Let me show you how to do that on a practical example.
How to remove text in brackets and parentheses using regex
In long text strings, less important information is often enclosed in [brackets] and (parentheses). How do you remove those irrelevant details keeping all other data?
In fact, we have already built a similar regex for deleting html tags, i.e. text within angle brackets. Obviously, the same methods will work for square and round brackets too.
Pattern: (\(.*?\))|(\[.*?\])
The trick is using a lazy quantifier (*?) to match the shortest possible substring. The first group (\(.*?\)) matches anything from an opening parenthesis to the first closing parenthesis. The second group (\[.*?\]) matches anything from an opening bracket to the first closing bracket. A vertical bar | acts as the OR operator.
With the pattern determined, let's "feed" it to our Regex Remove function. Here's how:
- On the Ablebits Data tab, in the Text group, click Regex Tools.
- On the Regex Tools pane, select your source strings, enter your regex, choose the Remove option, and hit Remove.
To get the results as formulas, not values, select the Insert as a formula check box.
To remove text within brackets from strings in A2:A5, we configure the settings as follows:
As the result, the AblebitsRegexRemove function is inserted in a new column next to your original data.
The function can also be entered directly in a cell via the standard Insert Function dialog box, where it is categorized under AblebitsUDFs.
As AblebitsRegexRemove is designed to remove text, it requires only two arguments - the source string and regex. Both parameters can be defined directly in a formula or supplied in the form of cell references. If needed, this custom function can be used together with any native ones.
For example, to trim extra spaces in the resulting strings, you can utilize the TRIM function as a wrapper:
=TRIM(AblebitsRegexRemove(A5, $A$2))
That's how to remove strings in Excel using regular expressions. I thank you for reading and look forward to seeing you on our blog next week!
Available downloads
Remove strings using regex - examples (.xlsm file)
Ultimate Suite - trial version (.exe file)