Applications of Regular Expressions in Real Life

Introduction:

Regular Expressions are useful for numerous practical day to day tasks that a data scientist encounters. They are used everywhere from data pre-processing to natural language processing, pattern matching, web scraping, data extraction and what not! The language accepted by finite automata can be easily described by simple expressions called Regular Expressions. It is the most effective way to represent any language. The languages accepted by some regular expression are referred to as Regular languages. A regular expression can also be described as a sequence of pattern that defines a string. Regular expressions are used to match character combinations in strings. String searching algorithm used this pattern to find the operations on a string.
Regular expressions are often used in Natural Language Processing for named entity extraction eg “Dr [A-Z] [A-Z][a-z]” to match the titular form for a doctor. Those sort of patterns aren’t really of much use though, Things get much more interesting if you can learn regexps form a set of known names, such that the regex matches the names and names that have similar structure. You might learn the previous example from Dr N Walton, Dr A Einstein, etc. The trick of course is to include what you have not seen, without necessarily having examples which are not to be matched while at the same not over generalizing.
Regular expressions are often feared by new developers, they see the strange syntax and opt to avoid them adding extra logic to solve their needs instead of trying to understand the logic behind them. In a regular expression, x* means zero or more occurrence of x. It can generate {e, x, xx, xxx, xxxx, .....}
In a regular expression, x+ means one or more occurrence of x. It can generate {x, xx, xxx, xxxx, .....}

Extracting emails from a Text Document
Smart Character Replacement

Extracting emails from a Text Document

Sales and marketing teams frequently need to locate/extract emails and other contact information from huge text documents.
If you try to do it manually, this might be a time-consuming operation! This is precisely the type of situation in which Regex excels. Here’s how you can code a basic email extractor:

Simply replace "text" with your document's text and you're set to go. Here's a sample of what we got:

Isn't it incredible? If you wish to read and process a file directly, simply add the file reading code to the Regex code:

The code may appear intimidating, but it is actually rather simple to comprehend. Let me explain it to you in more detail. To extract all the strings from the document, we use re.findall(), which has the following format:

any character a-z, any digit 0-9 and symbol '_' followed by a '@' symbol and after this symbol we can again have any character, any digit and especially a dot.

Here is an image that would give you a better understanding of the same

Wasn't that a piece of cake? That's the beauty of Regex: it allows you to execute extremely complicated operations with only a few simple phrases!

Smart Character Replacement

Let's finish up with the pattern validation and move on to some string alterations, shall we?
Regular Expressions flourish in this area as well, allowing you to perform some quite complex character replacements. I'm going to show you how to convert camel case notation (you know, the one where you Write Everything Like This) to conventional notation in this example. It's a simple example, but it should suffice to demonstrate what you can accomplish with groups.
Before you look at the code, consider how you would accomplish this without the use of a Regular Expression. You'd probably need a list of capitalized letters and a replace routine for each of them. There are probably other ways, but that one’s the easiest I can think of.
Here is the Regular Expression alternative:

That's all there is to it! The matching section is saved by the capturing group (the parenthesis and everything inside it), and you can refer to it with "$1."

If there were more than one group, the number would be increased ($2, $3, and so on). The point is that the expressions will only match single uppercase characters anywhere on the string (due to the trailing g flag), and you'll replace it with itself prefixed by a blank space (thanks to the replace method call).

Conclusion

Hopefully, the previous examples have demonstrated the power of Regular Expressions and demonstrated that, while they aren't pleasant to look at, they aren't difficult to grasp.

So, if you haven't already, give them a try and see if you can include this new tool into your development arsenal.

If you're not new to Regular Expressions, leave a comment below and tell us how you're utilising them!

Look forward to seeing you on the next one!

Blog By:

Vedant Parvekar

Niharika Hande

Praharsh Churi

Shrutika Nandurkar

If you enjoyed this blog, share it with your friends and colleagues and let us know your thoughts in the comments section!

Thank you so much !!

Search This Blog

Applications of Regular Expressions