Parsing text in Excel with functions like FIND, MID, and SUBSTITUTE can be cumbersome and inefficient, especially for complex tasks. These limitations make advanced text manipulation time-consuming and error-prone, not to mention frustrating.
The new Excel REGEX functions* revolutionize text manipulation. In this blog post, we'll delve into these new functions, providing practical examples to showcase their power and versatility.
*Currently available in the Microsoft 365 Beta channel.
Table of Contents
Watch the Video
Download Workbook & Cheat Sheet
Enter your email address below to download the workbook and cheat sheet.
Understanding Regular Expressions and How to Write Them
Regular expressions, commonly known as regex, are powerful tools used for pattern matching within strings.
They allow you to search, extract, and manipulate text based on specific patterns, making them incredibly useful for data validation, text parsing, and string manipulation.
How Regex Works
Regex uses a sequence of characters to define a search pattern. This pattern can match simple text sequences or more complex string structures. Here's a breakdown of how regex works:
Literal Characters
These are the simplest forms of patterns, matching exact sequences of characters.
The pattern cat will match the string "cat" in the text "The cat sat on the mat".
Metacharacters
These are special characters with specific meanings that allow for more complex patterns.
The dot . metacharacter matches any single character except a newline.
The pattern c.t will match "cat", "cut", "cot", etc.
Character Classes
Defined using square brackets [ ] they match any one character within the brackets.
The pattern [aeiou] will match any vowel.
The pattern c[aeiou]t will match "cat", "cot", "cut", etc.
Quantifiers
These specify how many times a character or group of characters should be matched.
The asterisk * matches zero or more occurrences of the preceding element.
The pattern ca*t will match "ct", "cat", "caaat", etc.
Anchors
These are used to match positions within a string rather than actual characters.
The caret ^ matches the start of a string, and the dollar sign $ matches the end of a string.
The pattern ^cat will match "cat" only if it is at the beginning of the string.
Groups and Alternation
Parentheses ( ) create groups, and the pipe | acts as an OR operator.
The pattern (cat|dog) will match either "cat" or "dog".
The pattern (c|d)og will match "cog" or "dog".
Regex Tokens
Symbols used to write regex patterns are called tokens. These tokens match a variety of characters. Here are some useful tokens to get you started:
Source: regex101.com Check them out for more tokens and support.
Writing Regular Expressions
Writing effective regular expressions involves understanding the specific patterns you need to match. Here's a step-by-step guide to writing your own regex:
- Identify the Text Pattern: Determine the specific text or structure you need to find. For example, an email address or a date.
- Start Simple: Begin with a basic pattern and gradually add complexity.
Example: To match a basic email address, start with the pattern for a sequence of word characters [\w]+
- Use Metacharacters and Classes: Incorporate special characters and classes to match more complex patterns.
Example: An email address pattern can be [\w.-]+@[\w.-]+\.\w+
This pattern matches an email address format, consisting of one or more word characters, dots, or hyphens, followed by an @ symbol, then one or more word characters, dots, or hyphens, a dot, and finally one or more word characters.
- Test and Refine: Use regex testing tools like regex101 to test your patterns against various strings and refine them as needed.
Excel REGEX Functions
Now that you're familiar with how regex works, let's try out the new Excel REGEX functions with some practical examples.
Excel REGEXTEST Function
The REGEXTEST function allows you to determine if a text string matches a specified pattern. This function returns TRUE if the pattern is found in the text, and FALSE otherwise.
Syntax
=REGEXTEST(text, pattern, [ignore_case])
text: The input string to test.
pattern: The regular expression pattern to match.
[ignore_case]: Optional. If TRUE (default), the match is case-insensitive.
Example 1: Test if a text string is present:
=REGEXTEST("Hello World", "world")
This will return TRUE because the function ignores case by default.
Example 2: Validate whether a string is a correctly formatted email address with the pattern ^[\w.-]+@[\w.-]+\.\w+$
Explanation: The pattern is designed to match a valid email address format with the following structure:
- Start of the string ^ : Ensures the pattern matches from the beginning of the string.
- Username part [\w.-]+ : Matches one or more characters that are either word characters (letters, digits, underscores), dots, or hyphens.
- @ character: Matches the literal "@" symbol separating the username and domain parts.
- Domain name part [\w.-]+ : Matches one or more characters that are either word characters, dots, or hyphens.
- Dot character \. : Matches the literal dot before the top-level domain.
- Top-level domain \w+ : Matches one or more word characters, representing the domain suffix.
- End of the string $ : Ensures the pattern matches up to the end of the string.
Formula
=REGEXTEST("user@example.com", "^[\w.-]+@[\w.-]+\.\w+$")
Returns: TRUE
Excel REGEXEXTRACT Function
The REGEXEXTRACT function extracts substrings that match a specified pattern from the input text.
Syntax
=REGEXEXTRACT(text, pattern, [return_mode], [ignore_case])
text: The input string.
pattern: The regular expression pattern to extract.
[return_mode]: Optional. Determines the return format:
0: First match (default)
1: Multiple matches as an array
2: Groupings for the first match as an array
[ignore_case]: Optional. If TRUE (default), the match is case-insensitive.
Example 1: Extract an email address with the pattern [\w.-]+@[\w.-]+\.\w+
Explanation: The pattern matches an email address format, where:
- It starts with one or more word characters, dots, or hyphens.
- Followed by the @ symbol.
- Then has one or more word characters, dots, or hyphens.
- Followed by a dot.
- Finally ending with one or more word characters.
Formula
=REGEXEXTRACT("Contact us at support@example.com", "[\w.-]+@[\w.-]+\.\w+")
This will return "support@example.com".
Example 2: Extract the domain name from a URL with the pattern (?<=//)(?:www\.)?([^/]+)
Explanation:
- (?<=//) : This is a positive lookbehind assertion. It ensures that the match is preceded by // without including // in the result.
- (?:www\.)? : Non-capturing group to optionally match "www.".
- ([^/]+) : Capturing group to match the domain name. It matches one or more characters that are not a slash.
Formula
=REGEXEXTRACT("https://www.MyOnlineTrainingHub.com/blog", "(?<=//)(?:www\.)?([^/]+)")
Returns: www.MyOnlineTrainingHub.com
Excel REGEXREPLACE Function
The REGEXREPLACE function replaces substrings matching a pattern with a replacement string.
Syntax
=REGEXREPLACE(text, pattern, replacement, [occurrence], [ignore_case])
text: The input string.
pattern: The regular expression pattern to match.
replacement: The replacement text.
[occurrence]: Optional. Determines which occurrences are replaced:
0: All occurrences (default)
n: nth occurrence from the start
-n: nth occurrence from the end
[ignore_case]: Optional. If TRUE (default), the match is case-insensitive.
Example 1: Redact the first 6 digits of a phone number using the pattern \d{3}-\d{3}
Explanation: The pattern matches a string of exactly three digits followed by a hyphen and then exactly three digits.
- \d : Matches any digit (0-9). It is a shorthand character class for numeric digits.
- {3} : This quantifier specifies that the preceding element (a digit, in this case) must occur exactly three times. So, \d{3} matches exactly three digits in a row.
- - : Matches a literal hyphen (dash) character. It is not a special character in this context but is used to match the hyphen itself in the string.
- \d : Again, matches any digit (0-9).
- {3} : This quantifier specifies that the preceding element (a digit) must occur exactly three times. So, \d{3} matches exactly three digits in a row.
Formula
=REGEXREPLACE("My phone number is 123-456-7890", "\d{3}-\d{3}", "XXX-XXX")
This will return "My phone number is XXX-XXX-7890".
Example 2: Replace the first 5 digits of a Social Security Number (SSN) with asterisks using the pattern \d{3}-\d{2}
Explanation: The pattern \d{3}-\d{2} matches a string of exactly three digits followed by a hyphen and then exactly two digits.
Formula
=REGEXREPLACE("My SSN is 123-45-6789", "\d{3}-\d{2}", "*-")
Returns: "My SSN is *--6789"
Conclusion
As you can see, understanding and mastering regex can significantly enhance your ability to manipulate and analyse text in Excel. With Excel's new REGEX functions, you can seamlessly integrate these powerful patterns into your workflows, making text processing tasks more efficient and effective.
However, if all this makes your head hurt, then you can do similar text manipulation and much more with Power Query, which is a lot more point and click. Get started with Power Query here.
David N
I’ve submitted my ideas to Microsoft’s Feedback Portal where anyone can up-vote them if you agree with me.
Add an option to REGEXEXTRACT to return Groupings for all matches
https://feedbackportal.microsoft.com/feedback/idea/504e2bd3-0325-ef11-8ee8-6045bdafcd41
Update the Microsoft VBScript Regular Expressions library to include more modern RegEx features
https://feedbackportal.microsoft.com/feedback/idea/f2c6995f-0725-ef11-8ee8-6045bdafcd41
Mynda Treacy
Thanks for sharing, David. I’ve forwarded your feedback to the Microsoft Excel team responsible for these new functions.
David N
I think it would be nice to have an option for REGEXEXTRACT that could return all groupings for all matches as a 2-D array. If you needed something like the third and fourth sub-matches from all matches, then being able to grab (CHOOSECOLS) those two columns from an array would be easier than having to reprocess all the individual matches against what might have to be a different secondary pattern. You wouldn’t always be able to just reuse the same initial pattern again because that initial pattern might have used something like a positive lookbehind that would no longer be applicable and this wouldn’t work if reapplied to the initial matches. This would also make it much easier to take control over how the matches are handled and returned.
Mynda Treacy
Hi David,
I think you can use the ‘return_mode’ argument to return all of the matches and then use the array shaping functions like VSTACK or HSTACK to place them in a 2-D array. Can’t test it on the PC I’m currently on though.
Mynda
David N
Since I haven’t seen anyone demonstrate it yet, I’m assuming using option 2 for return_mode will return an array of the sub-match groups, but only for the first match. Say for example you had a string of emails formatted in the typical Outlook style.
Treacy, Mynda <mtreacy@something.com>; Doe, Jane <jane.doe@imagine.org>; Sam Different <no.comma@ugh.net>; Public, John Q <johnqpublic@school.edu>
One of many patterns such as this…
\b([^,<]*)(?:, )?([^<]+)? ]+)>
…could parse that string as well as the substrings. If everything were returned as a 2-D array, then it would look like this (my apology for having to use spaces and pipes to create the imperfect illusion of columns)…
Treacy | Mynda | mtreacy@something.com
Doe | Jane | jane.doe@imagine.org
Sam Different | | no.comma@ugh.net
Public | John Q | johnqpublic@school.edu
…whereas I assume option 2 would only return the first row of that array. But if the full array were returned, then other Excel functions could be used to easily isolate, rearrange, and/or recombine its columns and/or rows so that the final output of a single dynamic array formula might be…
Mynda Treacy – mtreacy@something.com
Jane Doe – jane.doe@imagine.org
Sam Different – no.comma@ugh.net
John Q Public – johnqpublic@school.edu
As an aside, I’m particularly excited to see that the version of RegEx Excel will apparently use is more updated than what is available from the Microsoft VBScript Regular Expressions 5.5 reference library since it will recognize lookbehinds. But of course that also makes me wonder if/when Microsoft will ever release a newer version of the library that includes lookbehinds and other more modern RegEx features?
Mynda Treacy
Hi David,
I believe option 2 results in a spilled array of the results, so you could rearrange the values as you want. The new functions use the PCRE2 library.
Mynda
David N
Interesting. Do you know if it’s possible to enable a reference to a PCRE2 library in VBA, and if so, how? (I don’t readily see anything in the Tools > References list.) About 10 years ago I created my own RegEx UDF that uses the standard Microsoft VBScript Regular Expressions 5.5 library but mimics some of the more modern/advanced RegEx features (particularly on the Replace side of things) by incorporating additional array and string manipulations. So I’d be very intrigued to explore direct library access to more modern features.
Mynda Treacy
Not that I know of (but that doesn’t mean a definite no). AFAIK VBScript uses a subset of ECMA, but this is getting outside my area of expertise 😉