Splitting a string in C++

Splitting a string in C++

Ben Key:

June 11, 2013; Updated November 25, 2018

Introduction

A common task in programming is to split a delimited string into an array of tokens. For example, it may be necessary to split a string containing spaces into an array of words. This is one area where programming languages like Java and Python surpass C++ since both of these programming languages include support for this in their standard libraries while C++ does not. However, this task can be accomplished in C++ in various ways.

In Java this task could be accomplished using the String.split method as follows.


String Str = "The quick brown fox jumped over the lazy dog.";
String[] Results = Str.split(" ");

In Python this task could be accomplished using the str.split method as follows.


Str = "The quick brown fox jumped over the lazy dog."
Results = Str.split()

In C++ this task is not quite so simple. It can still be accomplished in a variety of different ways.

Using the C runtime library

One option is to use the C runtime library. The following C runtime library functions can be used to split a string.

Unfortunately, there are various differences between platforms. This necessitates the use of C preprocessor directives to determine what is appropriate for the current platform.

The following code demonstrates the technique.


char* FindToken(
    char* str, const char* delim, char** saveptr)
{
#if (_SVID_SOURCE || _BSD_SOURCE || _POSIX_C_SOURCE >= 1 \
  || _XOPEN_SOURCE || _POSIX_SOURCE)
    return ::strtok_r(str, delim, saveptr);
#elif defined(_MSC_VER) && (_MSC_VER >= 1800)
    return strtok_s(token, delim, saveptr);
#else
    return std::strtok(token, delim);
#endif
}

wchar_t* FindToken(
    wchar_t* token, const wchar_t* delim, wchar_t** saveptr)
{
#if ( (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L)) \
  || (defined(__cplusplus) && (__cplusplus >= 201103L)) )
    return std::wcstok(token, delim, saveptr);
#elif defined(_MSC_VER) && (_MSC_VER >= 1800)
    return wcstok_s(token, delim, saveptr);
#else
    return std::wcstok(token, delim);
#endif
}

char* CopyString(char* destination, const char* source)
{
    return std::strcpy(destination, source);
}

wchar_t* CopyString(wchar_t* destination, const wchar_t* source)
{
    return std::wcscpy(destination, source);
}

template <class charType>
size_t splitWithFindToken(
    const std::basic_string<charType>& str,
    const std::basic_string<charType>& delim,
    std::vector< std::basic_string<charType> >& tokens)
{
    std::unique_ptr<charType[]> ptr = std::make_unique<charType[]>(str.length() + 1);
    memset(ptr.get(), 0, (str.length() + 1) * sizeof(charType));
    CopyString(ptr.get(), str.c_str());
    charType* saveptr;
    charType* token = FindToken(ptr.get(), delim.c_str(), &saveptr);
    while (token != nullptr)
    {
        tokens.push_back(token);
        token = FindToken(nullptr, delim.c_str(), &saveptr);
    }
    return tokens.size();
}

Using the basic_istringstream class

Solution 1: std::istream_iterator

Perhaps the simplest method of accomplishing this task is using the basic_istringstream class as follows.


template <class charType>
size_t splitWithStringStream(
    const std::basic_string<charType>& str,
    std::vector< std::basic_string<charType> >& tokens)
{
    typedef std::basic_string<charType> my_string;
    typedef std::vector< std::basic_string<charType> > my_vector;
    typedef std::basic_istringstream<
        charType, std::char_traits<charType> >
        my_istringstream;
    typedef std::istream_iterator<
        std::basic_string<charType>, charType,
        std::char_traits<charType> >
        my_istream_iterator;
    tokens.clear();
    if (str.empty())
    {
        return 0;
    }
    my_istringstream iss(str);
    std::copy(
        my_istream_iterator{iss}, my_istream_iterator(),
        std::back_inserter<my_vector>(tokens));
    return tokens.size();
}

The splitWithStringStream function can be used as follows.


std::string str("The quick brown fox jumped over the lazy dog.");
std::vector<std::string> tokens;
size_t s = splitWithStringStream(str, tokens);

The splitWithStringStream function has the advantage of using nothing beyond functions that are part of the C++ standard library. To use it you just need to include the following C++ standard library headers: algorithm, iterator, sstream, string, and vector.

An alternate version of the function is as follows.


template <class charType>
size_t splitWithStringStream1(
    const std::basic_string<charType>& str,
    std::vector< std::basic_string<charType> >& tokens)
{
    typedef std::basic_string<charType> my_string;
    typedef std::vector< std::basic_string<charType> > my_vector;
    typedef std::basic_istringstream<
        charType, std::char_traits<charType> >
        my_istringstream;
    typedef std::istream_iterator<
        std::basic_string<charType>, charType,
        std::char_traits<charType> >
        my_istream_iterator;
    tokens.clear();
    if (str.empty())
    {
        return 0;
    }
    my_istringstream iss(str);
    std::vector<my_string> results(
        my_istream_iterator{iss}, my_istream_iterator());
    tokens.swap(results);
    return tokens.size();
}

The splitWithStringStream and the splitWithStringStream1 functions do have two drawbacks. First, the functions are potentially inefficient and slow since the entire string is copied into the stream, which takes up just as much memory as the string. Second, they only support space delimited strings.

Solution 2: std::getline

The following function makes it possible to use a character other than space as the delimiter.


template<typename charType>
size_t splitWithGetLine(
    const std::basic_string<charType>& str,
    const charType delim,
    std::vector< std::basic_string<charType> >& tokens)
{
    typedef std::basic_string<charType> my_string;
    typedef std::basic_istringstream<
        charType, std::char_traits<charType> >
        my_istringstream;
    tokens.clear();
    if (str.empty())
    {
       return 0;
    }
    my_istringstream iss(str);
    my_string token;
    while (std::getline(iss, token, delim))
    {
        tokens.push_back(token);
    }
    return tokens.size();
}

This function can be used as follows.


std::wstring str(L"This is a test.||This is only a test.|This concludes this test.");
std::vector<std::wstring> tokens;
size_t s = splitWithGetLine(str, L'|', tokens);

Note that this solution does not skip empty tokens, so the above example will result in tokens containing four items, one of which would be an empty string.

This function, like the splitWithStringStream and splitWithStringStream1 functions, is potentially inefficient and slow. It does allow the delimiter character to be specified. However, it only supports a single delimiting character. This function does not support strings in which several delimiting characters may be used.

To use the splitWithGetLine function you just need to include the following C++ standard library headers: algorithm, iterator, sstream, string, and vector.

Using only members of the basic_string class

Solution 1

It is possible to accomplish this task using only member functions of the basic_string class. The following function allows you to specify the delimiting character and uses only the find_first_not_of, find, and substr members of the basic_string class. The function also has optional parameters that allow you to specify that empty tokens should be ignored and to specify a maximum number of segments that the string should be split into.


template<typename charType>
size_t splitWithBasicString(
    const std::basic_string<charType>& str,
    const charType delim,
    std::vector< std::basic_string<charType> > &tokens,
    const bool trimEmpty = false,
    const size_t maxTokens = (size_t)(-1))
{
    typedef std::basic_string<charType> my_string;
    typedef typename my_string::size_type my_size_type;
    tokens.clear();
    if (str.empty())
    {
        return 0;
    }
    my_size_type len = str.length();
    // Skip delimiters at beginning.
    my_size_type left = str.find_first_not_of(delim, 0);
    size_t i = 1;
    if (!trimEmpty && left != 0)
    {
        tokens.push_back(my_string());
        ++i;
    }
    while (i < maxTokens)
    {
        my_size_type right = str.find(delim, left);
        if (right == my_string::npos)
        {
            break;
        }
        if (!trimEmpty || right - left > 0)
        {
            tokens.push_back(str.substr(left, right - left));
            ++i;
        }
        left = right + 1;
    }
    if (left < len)
    {
        tokens.push_back(str.substr(left));
    }
    return tokens.size();
}

This function does not suffer from the same potential performance issues as the stream based functions and it allows you to specify the delimiting character. However, it only supports a single delimiting character. This function does not support strings in which several delimiting characters may be used.

To use the splitWithBasicString function you just need to include the following C++ standard library headers: string and vector.

Solution 2

Sometimes the string that is to split uses several different delimiting characters. At other times it may simply be impossible to know for certain in advance what delimiting characters are used. In these cases you may know that the delimiting character could be one of several possibilities. In this case it is necessary for the function to be able to accept a string containing each possible delimiting character. This too can be accomplished using only member functions of the basic_string class.

The following function allows you to specify the delimiting character and uses only the find_first_not_of, find_first_of, and substr members of the basic_string class. The function also has optional parameters that allow you to specify that empty tokens should be ignored and to specify a maximum number of segments that the string should be split into.


template<typename charType>
size_t splitWithBasicString(
    const std::basic_string<charType>& str,
    const std::basic_string<charType>& delim,
    std::vector< std::basic_string<charType> >& tokens,
    const bool trimEmpty = false,
    const size_t maxTokens = (size_t)(-1))
{
    typedef std::basic_string<charType> my_string;
    typedef typename my_string::size_type my_size_type;
    tokens.clear();
    if (str.empty())
    {
       return 0;
    }
    my_size_type len = str.length();
    // Skip delimiters at beginning.
    my_size_type left = str.find_first_not_of(delim, 0);
    size_t i = 1;
    if (!trimEmpty && left != 0)
    {
        tokens.push_back(my_string());
        ++i;
    }
    while (i < maxTokens)
    {
        my_size_type right = str.find_first_of(delim, left);
        if (right == my_string::npos)
        {
           break;
        }
        if (!trimEmpty || right - left > 0)
        {
            tokens.push_back(str.substr(left, right - left));
            ++i;
        }
        left = right + 1;
    }
    if (left < len)
    {
       tokens.push_back(str.substr(left));
    }
    return tokens.size();
}

Using Boost

Boost is a collection of peer-reviewed, cross-platform, open source C++ libraries that are designed to complement and extend the C++ standard library. Boost provides at least two methods for splitting a string.

Solution 1

One option is to use the boost::algorithm::split function in the Boost String Algorithms Library.

In order to use the split function simply include boost/algorithm/string.hpp and then call the function as follows.


std::string str(" The  quick brown fox\tjumped over the lazy dog.");
std::vector<std::string> strs;
boost::split(strs, str, boost::is_any_of("\t "));

Solution 2

Another option is to use the Boost Tokenizer Library. In order to use the Boost Tokenizer Library simply include boost/tokenizer.hpp. Then you can use the Boost Tokenizer as follows.


typedef boost::char_separator<char> my_separator;
typedef boost::tokenizer<my_separator> my_tokenizer;
std::string str(" The  quick brown fox\tjumped over the lazy dog.");
my_separator sep(" \t");
my_tokenizer tokens(str, sep);
my_tokenizer::iterator itEnd = tokens.end();
for (my_tokenizer::iterator it = tokens.begin(); it != itEnd; ++it)
{
    std::cout << *it << std::endl;
}

Using the C++ String Toolkit Library

Another option is to use the C++ String Toolkit Library. The following example shows how the strtk::parse function can be used to split a string.


std::string str("The quick brown fox jumped over the lazy dog.");
std::vector<std::string> tokens;
strtk::parse(str, " ", tokens);

Other Options

Of course there are many other options. Feel free to refer to the web pages listed in the references section below for many other options.

Summary

In this article I discussed several options for splitting strings in C++.

The code for the basic_istringstream class, the basic_string class, and Boost along with a complete sample demonstrating the use of the the functions can be found on Ideone.

References

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.