Regular Expression Basics in C++

Consider the following sentence in quotes,

“Here is my man.”

This string may be inside the computer, and the user may want to know if it has the word “man”. If it has the word man, he may then want to change the word “man” to “woman”; so that the string should read:

“Here is my woman.”

There are many other desires like these from the computer user; some are complex. Regular Expression, abbreviated, regex, is the subject of handling these issues by the computer. C++ comes with a library called regex. So, a C++ program to handle regex should begin with:

#include <iostream>

#include <regex>

using namespace std;

This article explains Regular Expression Basics in C++.

Article Content

Regular Expression Fundamentals

Regex

A string like “Here is my man.” above is the target sequence or target string or simply, target. “man”, which was searched for, is the regular expression, or simply, regex.

Matching

Matching is said to occur when the word or phrase that is being searched for is located. After matching, a replacement can take place. For example, after “man” is located above, it can be replaced by “woman”.

Simple Matching

The following program shows how the word “man” is matched.

#include <iostream>

#include <regex>

using namespace std;

int main()
{

    regex reg("man");
    if (regex_search("Here is my man.", reg))
        cout << "matched" << endl;
    else
        cout << "not matched" << endl;

    return 0;
}

The function regex_search() returns true if there is a match and returns false if no match occurs. Here, the function takes two arguments: the first is the target string, and the second is the regex object. The regex itself is "man", in double-quotes. The first statement in the main() function forms the regex object. Regex is a type, and reg is the regex object. The above program's output is "matched", as "man" is seen in the target string. If "man" were not seen in the target, regex_search() would have returned false, and the output would have been "not matched".

The output of the following code is “not matched”:

    regex reg("man");
    if (regex_search("Here is my making.", reg))
        cout << "matched" << endl;
    else
        cout << "not matched" << endl;

Not matched because the regex "man" could not be found in the entire target string, "Here is my making."

Pattern

The regular expression, “man” above, is very simple. Regexes are usually not that simple. Regular expressions have metacharacters. Metacharacters are characters with special meanings. A metacharacter is a character about characters. C++ regex metacharacters are:

^ $ \ . * + ? ( ) [ ] { } |

A regex, with or without metacharacters, is a pattern.

Character Classes

Square Brackets

A pattern can have characters within square brackets. With this, a particular position in the target string would match any of the square brackets’ characters. Consider the following targets:

"The cat is in the room."

"The bat is in the room."

"The rat is in the room."

The regex, [cbr]at would match cat in the first target. It would match bat in the second target. It would match rat in the third target. This is because, “cat” or “bat” or “rat” begins with ‘c’ or ‘b’ or ‘r’. The following code segment illustrates this:

    regex reg("[cbr]at");
    if (regex_search("The cat is in the room.", reg))
        cout << "matched" << endl;
    if (regex_search("The bat is in the room.", reg))
        cout << "matched" << endl;
    if (regex_search("The rat is in the room.", reg))
        cout << "matched" << endl;

The output is:

matched

matched

matched

Range of Characters

The class, [cbr] in the pattern [cbr], would match several possible characters in the target. It would match ‘c’ or ‘b’ or ‘r’ in the target. If the target does not have any of ‘c’ or ‘b’ or ‘r’, followed by “at”, there would be no match.

Some possibilities like ‘c’ or ‘b’ or ‘r’ exist in a range. The range of digits, 0 to 9 has 10 possibilities, and the pattern for that is [0-9]. The range of lowercase alphabets, a to z, has 26 possibilities, and the pattern for that is [a-z]. The range of uppercase alphabets, A to Z, has 26 possibilities, and the pattern for that is [A-Z]. – is not officially a metacharacter, but within square brackets, it would indicate a range. So, the following produces a match:

if (regex_search("ID6id", regex("[0-9]")))

  cout << "matched" << endl;

Note how the regex has been constructed as the second argument. The match occurs between the digit, 6 in the range, 0 to 9, and the 6 in the target, “ID6id”. The above code is equivalent to:

if (regex_search("ID6id", regex("[0123456789]")))

  cout << "matched" << endl;

The following code produces a match:

char str[] = "ID6iE";

if (regex_search(str, regex("[a-z]")))

  cout << "matched" << endl;

Note that the first argument here is a string variable and not the string literal. The match is between ‘i’ in [a-z] and ‘i’ in “ID6iE”.

Do not forget that a range is a class. There can be text to the right of the range or to the left of the range in the pattern. The following code produces a match:

if (regex_search("ID2id<a id="post-95222-__DdeLink__294_3116034780"></a> is an ID", regex("ID[0-9]id")))

 cout << "matched" << endl;

The match is between “ID[0-9]id” and “ID2id”. The rest of the target string, “ is an ID,” is not matched in this situation.

As used in the regular expression subject (regexes), the word class actually means a set. That is, one of the characters in the set is to match.

Note: The hyphen – is a metacharacter only within square brackets, indicating a range. It is not a metacharacter in the regex, outside of the square brackets.

Negation

A class including a range can be negated. That is, non of the characters in the set (class) should match. This is indicated with the ^ metacharacter at the beginning of the class pattern, just after the opening square bracket. So, [^0-9] means matching the character at the appropriate position in the target, which is not any character in the range, 0 to 9 inclusive. So the following code will not produce a match:

if (regex_search("0123456789101112", regex("[^0-9]")))

  cout << "matched" << endl;

else

  cout << "not matched" << endl;

A digit within the range 0 to 9 could be found in any of the target string positions, “0123456789101112,”; so there is no match – negation.

The following code produces a match:

if (regex_search("ABCDEFGHIJ", regex("[^0-9]")))

  cout << "matched" << endl;

No digit could be found in the target, “ABCDEFGHIJ,”; so there is a match.

[a-z] is a range outside [^a-z]. And so [^a-z] is the negation of [a-z].

[A-Z] is a range outside [^A-Z]. And so [^A-Z] is the negation of [A-Z].

Other negations exist.

Matching Whitespaces

‘ ’ or \t or \r or \n or \f is a whitespace character. In the following code, the regex, “\n” matches ‘\n’ in the target:

if (regex_search("Of line one.\r\nOf line two.", regex("\n")))

  cout << "matched" << endl;

Matching any Whitespace Character

The pattern or class to match any white space character is, [ \t\r\n\f]. In the following code, ‘ ’ is matched:

if (regex_search("one two", regex("[ \t\r\n\f]")))

  cout << "matched" << endl;

Matching any Non-whitespace Character

The pattern or class to match any non-white space character is, [^ \t\r\n\f]. The following code produces a match because there is no whitespace in the target:

if (regex_search("1234abcd", regex("[^ \t\r\n\f]")))

  cout << "matched" << endl;

The period (.) in the Pattern

The period (.) in the pattern matches any character including itself, except \n, in the target. A match is produced in the following code:

if (regex_search("1234abcd", regex(".")))

  cout << "matched" << endl;

No matching results in the following code because the target is “\n”.

if (regex_search("\n", regex(".")))

  cout << "matched" << endl;

else

  cout << "not matched" << endl;

Note: Inside a character class with square brackets, the period has no special meaning.

Matching Repetitions

A character or a group of characters can occur more than once within the target string. A pattern can match this repetition. The metacharacters, ?, *, +, and {} are used to match the repetition in the target. If x is a character of interest in the target string, then the metacharacters have the following meanings:

x*: means match 'x' 0 or more times, i.e., any number of times

x+: means match 'x' 1 or more times, i.e., at least once

x? : means match 'x' 0 or 1 time

x{n,}: means match 'x' at least n or more times. Note the comma.

x{n} : match 'x' exactly n times

x{n,m}: match 'x' at least n times, but not more than m times.

These metacharacters are called quantifiers.

Illustrations

*

The * matches the preceding character or preceding group, zero or more times. “o*” matches ‘o’ in “dog” of the target string. It also matches “oo” in “book” and “looking”. The regex, “o*” matches “boooo” in “The animal booooed.”. Note: “o*” matches “dig”, where ‘o’ occurs zero (or more) time.

+

The + matches the preceding character or preceding group, 1 or more times. Contrast it with zero or more times for *. So the regex, “e+” matches ‘e’ in “eat”, where ‘e’ occurs one time. “e+” also matches “ee” in “sheep”, where ‘e’ occurs more than one time. Note: “e+” will not match “dig” because in “dig”, ‘e’ does not occur at least once.

?

The ? matches the preceding character or preceding group, 0 or 1 time (and not more). So, “e?” matches “dig” because ‘e’ occurs in “dig”, zero time. “e?” matches “set” because ‘e’ occurs in “set”, one time. Note: “e?” still matches “sheep”; though there are two ‘e’s in “sheep”. There is a nuance here – see later.

{n,}

This matches at least n consecutive repetitions of a preceding character or preceding group. So the regex, “e{2,}” matches the two ‘e’s in the target, “sheep”, and the three ‘e’s in the target “sheeep”. “e{2,}” does not match “set”, because “set” has only one ‘e’.

{n}

This matches exactly n consecutive repetitions of a preceding character or preceding group. So the regex, “e{2}” matches the two ‘e’s in the target, “sheep”. “e{2}” does not match “set” because “set” has only one ‘e’. Well, “e{2}” matches two ‘e’s in the target, “sheeep”. There is a nuance here – see later.

{n,m}

This matches several consecutive repetitions of a preceding character or preceding group, anywhere from n to m, inclusive. So, “e{1,3}” matches nothing in “dig”, which has no ‘e’. It matches the one ‘e’ in “set”, the two ‘e’s in “sheep”, the three ‘e’s in “sheeep”, and three ‘e’s in “sheeeep”. There is a nuance at the last match – see later.

Matching Alternation

Consider the following target string in the computer.

“The farm has pigs of different sizes.”

The programmer may want to know if this target has “goat” or “rabbit” or “pig”. The code would be as follows:

char str[] = "The farm has pigs of different sizes.";

if (regex_search(str, regex("goat|rabbit|pig")))

  cout << "matched" << endl;

else

  cout << "not matched" << endl;

The code produces a match. Note the use of the alternation character, |. There can be two, three, four, and more options. C++ will first try to match the first alternative, “goat,” at each character position in the target string. If it does not succeed with “goat”, it tries the next alternative, “rabbit”. If it does not succeed with “rabbit”, it tries the next alternative, “pig”. If “pig” fails, then C++ moves on to the next position in the target and starts with the first alternative again.

In the above code, “pig” is matched.

Matching Beginning or End

Beginning


If ^ is at the beginning of the regex, then the beginning text of the target string can be matched by the regex. In the following code, the start of the target is “abc”, which is matched:

if (regex_search("abc and def", regex("^abc")))

  cout << "matched" << endl;

No matching takes place in the following code:

if (regex_search("Yes, abc and def", regex("^abc")))

  cout << "matched" << endl;

else

  cout << "not matched" << endl;

Here, “abc” is not at the beginning of the target.

Note: The circumflex character, ‘^’, is a metacharacter at the start of the regex, matching the start of the target string. It is still a metacharacter at the start of the character class, where it negates the class.

End

If $ is at the end of the regex, then the ending text of the target string can be matched by the regex. In the following code, the end of the target is “xyz”, which is matched:

if (regex_search("uvw and xyz", regex("xyz$")))

  cout << "matched" << endl;

No matching takes place in the following code:

if (regex_search("uvw and xyz final", regex("xyz$")))

  cout << "matched" << endl;

else

  cout << "not matched" << endl;

Here, “xyz” is not at the end of the target.

Grouping

Parentheses can be used to group characters in a pattern. Consider the following regex:

"a concert (pianist)"

The group here is “pianist” surrounded by the metacharacters ( and ). It is actually a sub-group, while “a concert (pianist)” is the whole group. Consider the following:

"The (pianist is good)"

Here, the sub-group or sub-string is, “pianist is good”.

Sub-strings with Common Parts

A bookkeeper is a person who takes care of books. Imagine a library with a bookkeeper and bookshelf. Assume that one of the following target strings are in the computer:

"The library has a bookshelf that is admired.";

"Here is the bookkeeper.";

"The bookkeeper works with the bookshelf.";

Assume that the programmer’s interest is not to know which of these sentences is in the computer. Still, his interest is to know if “bookshelf” or “bookkeeper” is present in whatever target string is in the computer. In this case, his regex can be:

"bookshelf|bookkeeper."

Using alternation.

Notice that “book”, which is common to both words, has been typed twice, in the two words in the pattern. To avoid typing “book” twice, the regex would be better written as:

"book(shelf|keeper)"

Here, the group, “shelf|keeper” The alternation metacharacter has still been used, but not for two long words. It has been used for the two ending parts of the two long words. C++ treats a group as an entity. So, C++ will look for “shelf” or “keeper” that comes immediately after “book”. The output of the following code is “matched”:

char str[] = "The library has a bookshelf that is admired.";

if (regex_search(str, regex("book(shelf|keeper)")))

  cout << "matched" << endl;

“bookshelf” and not “bookkeeper” have been matched.

The icase and multiline regex_constants

icase

Matching is case sensitive by default. However, it can be made case insensitive. To achieve this, use the regex::icase constant, as in the following code:

if (regex_search("Feedback", regex("feed", regex::icase)))

  cout << "matched" << endl;

The output is “matched”. So “Feedback” with uppercase ‘F’ has been matched by “feed” with lowercase ‘f’. “regex::icase” has been made the second argument of the regex() constructor. Without that, the statement would not produce a match.

Multiline

Consider the following code:

char str[] = "line 1\nline 2\nline 3";

if (regex_search(str, regex("^.*$")))

  cout << "matched" << endl;

else

  cout << "not matched" << endl;

The output is “not matched”. The regex, “^.*$,” matches the target string from its beginning to its end. “.*” means any character except \n, zero or more times. So, because of the newline characters (\n) in the target, there was no matching.

The target is a multiline string. In order for ‘.’ to match the newline character, the constant “regex::multiline” has to be made, the second argument of the regex() construction. The following code illustrates this:

char str[] = "line 1\nline 2\nline 3";

if (regex_search(str, regex("^.*$", regex::multiline)))

  cout << "matched" << endl;

else

  cout << "not matched" << endl;

Matching the Whole Target String

To match the whole target string, which does not have the newline character (\n), the regex_match() function can be used. This function is different from regex_search(). The following code illustrates this:

char str[] = "first second third";

if (regex_match(str, regex(".*second.*")))

  cout << "matched" << endl;

There is a match here. However, note that the regex matches the whole target string, and the target string does not have any ‘\n’.

The match_results Object

The regex_search() function can take an argument in-between the target and the regex object. This argument is the match_results object. The whole matched (part) string and the sub-strings matched can be known with it. This object is a special array with methods. The match_results object type is cmatch (for string literals).

Obtaining Matches

Consider the following code:

char str[] = "The woman you were looking for!";

cmatch m;

if (regex_search(str, m, regex("w.m.n")))

  cout << m[0] << endl;

The target string has the word “woman”. The output is “woman’, which corresponds to the regex, “w.m.n”. At index zero, the special array holds the only match, which is “woman”.

With class options, only the first sub-string found in the target, is sent to the special array. The following code illustrates this:

cmatch m;

if (regex_search("The rat, the cat, the bat!", m, regex("[bcr]at")))

  cout << m[0] << endl;

  cout << m[1] << endl;

  cout << m[2] << endl;

The output is “rat” from index zero. m[1] and m[2] are empty.

With alternatives, only the first sub-string found in the target, is sent to the special array. The following code illustrates this:

if (regex_search("The rabbit, the goat, the pig!", m, regex("goat|rabbit|pig")))

  cout << m[0] << endl;

  cout << m[1] << endl;

  cout << m[2] << endl;

The output is “rabbit” from index zero. m[1] and m[2] are empty.

Groupings

When groups are involved, the complete pattern matched, goes into cell zero of the special array. The next sub-string found goes into cell 1; the sub-string following, goes into cell 2; and so on. The following code illustrates this:

if (regex_search("Best bookseller today!", m, regex("book((sel)(ler))")))

  cout << m[0] << endl;

  cout << m[1] << endl;

  cout << m[2] << endl;

  cout << m[3] << endl;

The output is:

bookseller

seller

sel

ler

Note that the group (seller) comes before the group (sel).

Position of Match

The position of match for each sub-string in the cmatch array can be known. Counting begins from the first character of the target string, at position zero. The following code illustrates this:

cmatch m;

if (regex_search("Best bookseller today!", m, regex("book((sel)(ler))")))

  cout << m[0] << "->" << m.position(0) << endl;

  cout << m[1] << "->" << m.position(1) << endl;

  cout << m[2] << "->" << m.position(2) << endl;

  cout << m[3] << "->" << m.position(3) << endl;

Note the use of the position property, with the cell index, as an argument. The output is:

bookseller->5

seller->9

sel->9

ler->12

Search and Replace

A new word or phrase can replace the match. The regex_replace() function is used for this. However, this time, the string where the replacement occurs is the string object, not the string literal. So, the string library has to be included in the program. Illustration:

#include <iostream>

#include <regex>

#include <string>

using namespace std;

int main()
{
    string str = "Here, comes my man. There goes your man.";
    string newStr = regex_replace(str, regex("man"), "woman");
    cout << newStr << endl;  

    return 0;
}

The regex_replace() function, as coded here, replaces all the matches. The first argument of the function is the target, the second is the regex object, and the third is the replacement string. The function returns a new string, which is the target but having the replacement. The output is:

“Here comes my woman. There goes your woman.”

Conclusion

The regular expression uses patterns to match substrings in the target sequence string. Patterns have metacharacters. Commonly used functions for C++ regular expressions, are: regex_search(), regex_match() and regex_replace(). A regex is a pattern in double-quotes. However, these functions take the regex object as an argument and not just the regex. The regex must be made into a regex object before these functions can use it.



from Linux Hint https://ift.tt/3lPPevR

Post a Comment

0 Comments