More Rules to the Regular Expression Library

Contents[Show]

There is more to write to the usage of regular expressions than I wrote in my last post The Regular Expression Library. Let's continue.

 

antique hand knowledge 207681

 

The text determines the regular expression, the result, and the capture groups

First of all, the type of the text determines the character type of the regular expression, the type of the search result, and the type of the capture group. Of course, my argumentation also holds if apply other parts of the regex machinery to text. Okay, that sounds worse than it is. A capture is a subexpression in your search result, which you can define in round braces. I wrote already about it in my last post The Regular Expression Library.

The table gives all the types depending on the text type.

CCoreGuidelinesMoreRulesToRegexNew

 

Here is an example to all the variations of std::regex_search depending on the text type.

 
// search.cpp

#include <iostream>
#include <regex>
#include <string>

int main(){

  std::cout << std::endl;

  // regular expression for time
  std::regex crgx("([01]?[0-9]|2[0-3]):[0-5][0-9]");

  // const char*
  std::cout << "const char*" << std::endl;
  std::cmatch cmatch;

  const char* ctime{"Now it is 23:10."};

  if (std::regex_search(ctime, cmatch, crgx)){

     std::cout << ctime << std::endl;
     std::cout << "Time: " << cmatch[0] << std::endl;

   }

  std::cout << std::endl;

  // std::string
  std::cout << "std::string" << std::endl;
  std::smatch smatch;

  std::string stime{"Now it is 23:25."};
  if (std::regex_search(stime, smatch, crgx)){

    std::cout << stime << std::endl;
    std::cout << "Time: " << smatch[0] << std::endl;

  }

  std::cout << std::endl;

  // regular expression holder for time
  std::wregex wrgx(L"([01]?[0-9]|2[0-3]):[0-5][0-9]");

  // const wchar_t*
  std::cout << "const wchar_t* " << std::endl;
  std::wcmatch wcmatch;

  const wchar_t* wctime{L"Now it is 23:47."};

  if (std::regex_search(wctime, wcmatch, wrgx)){

       std::wcout << wctime << std::endl;
       std::wcout << "Time: " << wcmatch[0] << std::endl;

  }

  std::cout << std::endl;

  // std::wstring
  std::cout << "std::wstring" << std::endl;
  std::wsmatch wsmatch;

  std::wstring  wstime{L"Now it is 00:03."};

  if (std::regex_search(wstime, wsmatch, wrgx)){

    std::wcout << wstime << std::endl;
    std::wcout << "Time: " << wsmatch[0] << std::endl;

  }

  std::cout << std::endl;

}

First, I used a const char*, a std::string, a const wchar_t*, and finally a std::wstring as text. Because of it almost the same code in the four variations, from now one and for the rest of this post, I will only refer to the std::string.

The text contains a substring which stands for a time expression. Thanks to the regular expression "([01]?[0-9]|2[0-3]):[0-5][0-9]", I can search for it.  The regular expression defines a time format consisting of an hour and minute, separated by a colon. Here is the hour and minute part:

  • hour: [01]?[0-9]|2[0-3]:
    • [01]?: 0 or 1 (optional)
    • [0-9]: a number from 0 to 9
    • |: stands for or
    • 2[0-3]: 2 followed by a number from 0 to 3
  • minute: [0-5][0-9]: a number from 0 to 5 followed by a number from 0 to 9 

Finally, the output of the program.

search

Don't use repeated std::search calls, because you can easily lose word boundaries or have empty hits. Use instead std::regex_iterator or std::regex_token_iterator for repeated search. std::regex_token_iterator allows you to address the components of each capture group or to address the text between the matches.

The "Hello World" of repeated search with regex is to count how often appears a word in a text. Here is the corresponding program.

// wordCount.cpp

#include <algorithm>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <regex>
#include <string>
#include <map>
#include <unordered_map>
#include <utility>

using str2Int = std::unordered_map<std::string, std::size_t>;          // (1)
using intAndWords = std::pair<std::size_t, std::vector<std::string>>;
using int2Words= std::map<std::size_t,std::vector<std::string>>; 


// count the frequency of each word
str2Int wordCount(const std::string &text) {
  std::regex wordReg(R"(\w+)");                                        // (2)
  std::sregex_iterator wordItBegin(text.begin(), text.end(), wordReg); // (3)
  const std::sregex_iterator wordItEnd;
  str2Int allWords;
  for (; wordItBegin != wordItEnd; ++wordItBegin) {
    ++allWords[wordItBegin->str()];
  }
  return allWords;
}

// get to each frequency the words
int2Words frequencyOfWords(str2Int &wordCount) {
  int2Words freq2Words;
  for (auto wordIt : wordCount) {
    auto freqWord = wordIt.second;
    if (freq2Words.find(freqWord) == freq2Words.end()) {
      freq2Words.insert(intAndWords(freqWord, {wordIt.first}));
    } else {
      freq2Words[freqWord].push_back(wordIt.first);
    }
  }
  return freq2Words;
}

int main(int argc, char *argv[]) {

  std::cout << std::endl;

  // get the filename
  std::string myFile;
  if (argc == 2) {
    myFile = {argv[1]};
  } else {
    std::cerr << "Filename missing !" << std::endl;
    exit(EXIT_FAILURE);
  }

  // open the file
  std::ifstream file(myFile, std::ios::in);
  if (!file) {
    std::cerr << "Can't open file " + myFile + "!" << std::endl;
    exit(EXIT_FAILURE);
  }

  // read the file
  std::stringstream buffer;
  buffer << file.rdbuf();
  std::string text(buffer.str());

  // get the frequency of each word
  auto allWords = wordCount(text);                                     

  std::cout << "The first 20 (key, value)-pairs: " << std::endl;
  auto end = allWords.begin();
  std::advance(end, 20);
  for (auto pair = allWords.begin(); pair != end; ++pair) {            // (4)
    std::cout << "(" << pair->first << ": " << pair->second << ")";
  }
  std::cout << "\n\n";

  std::cout << "allWords[Web]: " << allWords["Web"] << std::endl;      // (5)
  std::cout << "allWords[The]: " << allWords["The"] << "\n\n";

  std::cout << "Number of unique words: ";
  std::cout << allWords.size() << "\n\n";                              // (6)

  size_t sumWords = 0;
  for (auto wordIt : allWords)
    sumWords += wordIt.second;
  std::cout << "Total number of words: " << sumWords << "\n\n";

  auto allFreq = frequencyOfWords(allWords);                           

                                                                       // (7)
  std::cout << "Number of different frequencies: " << allFreq.size() << "\n\n";

  std::cout << "All frequencies: ";                                    // (8)
  for (auto freqIt : allFreq)
    std::cout << freqIt.first << " ";
  std::cout << "\n\n";

  std::cout << "The most frequently used word(s): " << std::endl;      // (9)
  auto atTheEnd = allFreq.rbegin();
  std::cout << atTheEnd->first << " :";
  for (auto word : atTheEnd->second)
    std::cout << word << " ";
  std::cout << "\n\n";

                                                                       // (10)
  std::cout << "All words which appear more than 1000 times:" << std::endl;
  auto biggerIt =
      std::find_if(allFreq.begin(), allFreq.end(),
                   [](intAndWords iAndW) { return iAndW.first > 1000; });
  if (biggerIt == allFreq.end()) {
    std::cerr << "No word appears more than 1000 times !" << std::endl;
    exit(EXIT_FAILURE);
  } else {
    for (auto allFreqIt = biggerIt; allFreqIt != allFreq.end(); ++allFreqIt) {
      std::cout << allFreqIt->first << " :";
      for (auto word : allFreqIt->second)
        std::cout << word << " ";
      std::cout << std::endl;
    }
  }
  std::cout << std::endl;
}

 

To better understand the program, I added a few comments to it.

The using declarations in line 1 help me to type less. The function wordCount determines the frequency of each word and the function frequencyOfWords return to each frequency all words. What is a word? Lines 2 defines it with the regular expression, and line 3 uses it in a std::sregex_iterator. Let's see, which answers I can give with the two functions?

  • Line 4: first 20 (key, value)-pairs
  • Line 5: frequency of the words "Web" and "The"
  • Line 6: number of unique words
  • Line 7: number of frequencies
  • Line 8: all appearing frequencies
  • Line 9: the most frequently used word
  • Line 10: words, that appear more than 1000 times

Now, I need a lengthy text. Of course, I will use grimm's fairy tales from the project gutenberg. Here is the output:

wordCount

What's next?

I'm almost done with the regex functionality in C++, but I have one guideline in mind which makes repeated search often easier: Search not for the text patterns, but the delimiters of the text patterns. I call this negative search.

 

 

Thanks a lot to my Patreon Supporters: Paul Baxter,  Meeting C++, Matt Braun, Avi Lachmish, Roman Postanciuc, Venkata Ramesh Gudpati, Tobias Zindl, Marko, Ramesh Jangama, G Prvulovic, Reiner Eiteljörge, Benjamin Huth, Reinhold Dröge, Timo, Abernitzke, Richard Ohnemus , Frank Grimm, Sakib, and Broeserl.

 

Thanks in particular to:
 TakeUpCode 450 60
crp4

 

   

Get your e-book at Leanpub:

The C++ Standard Library

 

Concurrency With Modern C++

 

Get Both as one Bundle

cover   ConcurrencyCoverFrame   bundle
With C++11, C++14, and C++17 we got a lot of new C++ libraries. In addition, the existing ones are greatly improved. The key idea of my book is to give you the necessary information to the current C++ libraries in about 200 pages.  

C++11 is the first C++ standard that deals with concurrency. The story goes on with C++17 and will continue with C++20.

I'll give you a detailed insight in the current and the upcoming concurrency in C++. This insight includes the theory and a lot of practice with more the 100 source files.

 

Get my books "The C++ Standard Library" (including C++17) and "Concurrency with Modern C++" in a bundle.

In sum, you get more than 600 pages full of modern C++ and more than 100 source files presenting concurrency in practice.

 

Get your interactive course

 

Modern C++ Concurrency in Practice

C++ Standard Library including C++14 & C++17

educative CLibrary

Based on my book "Concurrency with Modern C++" educative.io created an interactive course.

What's Inside?

  • 140 lessons
  • 110 code playgrounds => Runs in the browser
  • 78 code snippets
  • 55 illustrations

Based on my book "The C++ Standard Library" educative.io created an interactive course.

What's Inside?

  • 149 lessons
  • 111 code playgrounds => Runs in the browser
  • 164 code snippets
  • 25 illustrations

Add comment


My Newest E-Books

Course: Modern C++ Concurrency in Practice

Course: C++ Standard Library including C++14 & C++17

Course: Embedded Programming with Modern C++

Course: Generic Programming (Templates)

Subscribe to the newsletter (+ pdf bundle)

Blog archive

Source Code

Visitors

Today 2419

All 2689952

Currently are 231 guests and no members online

Kubik-Rubik Joomla! Extensions

Latest comments