Using Direct Annotation in C++

Sometimes you will encounter a situation in which the existing accessibility support for an user interface element is almost good enough and you need to just make small changes to the existing support. For example, perhaps you just need to override the Name or Role properties. In these cases, fully implementing the IAccessible interface as described in Implementing a Microsoft Active Accessibility (MSAA) Server Using the Windows API and C++ may seem like a great deal of effort to achieve these improvements.

Fortunately, starting with Windows 7 there is an easier alternative, the Direct Annotation feature of the Dynamic Annotation API. The Dynamic Annotation API is extension to Microsoft Active Accessibility that makes it possible to customize existing IAccessible support without using subclassing or wrapping techniques. Direct Annotation is one of several features provided by the Dynamic Annotation API; it is the simplest of the features provided.

Direct Annotation makes it possible to customize the following properties.

Since Name, Role, Value, and State are included in the list of properties supported by the Direct Annotation feature, this feature is sufficient to meet WCAG Success Criterion 4.1.2: Name, Role, Value.

Implementation Details

The first step to using the Direct Annotation feature of the Dynamic Annotation API is to create an [IAccPropServices][] object using either the CoCreateInstance function or CoCreateInstanceEx function. Then you can use either the IAccPropServices::SetHwndProp method or the IAccPropServices::SetHwndPropStr method.

The following sample from Using Direct Annotation demonstrates the technique.



More complete examples can be found in the following locations.

Note that using the IAccPropServices::SetHwndProp method or the IAccPropServices::SetHwndPropStr method will not send WinEvents. It is your responsibility to send appropriate events by calling the NotifyWinEvent function after the IAccPropServices::SetHwndProp method or the IAccPropServices::SetHwndPropStr method has been called.

For example, if you use IAccPropServices::SetHwndPropStr method to set the Name property of an element, send an EVENT_OBJECT_NAMECHANGE event for that object after SetHwndPropStr returns.

The CWndWithCustomizedAccessibleInfo class and DirectAnnotation class both make the necessary calls to the NotifyWinEvent function for you.

Advertisement

Implementing a Microsoft Active Accessibility (MSAA) Server Using the Windows API and C++

The oldest and simplest of the Accessibility APIs used in Microsoft Windows is Microsoft Active Accessibility or MSAA. This article provides a brief introduction on how to implement a MSAA Server using the Windows API and the C++ programming language.

The IAccessible interface

The heart of MSAA is the IAccessible interface. This interface exposes methods and properties that allow an accessibility server to make a user interface element and its children accessible to accessibility client applications. The methods and properties of the IAccessible interface allow a Windows API application to meet WCAG Success Criterion 4.1.2: Name, Role, Value, which requires that an accessibility client be able to programmatically determine the Name, Role, Value, and States of all user interface components (including but not limited to: form elements, links and components generated by scripts).

The following table provides information the methods of the IAccessible interface that are used to expose each of these properties.

Property Method
Name IAccessible::get_accName
Role IAccessible::get_accRole
State IAccessible::get_accState
Value IAccessible::get_accValue

See the IAccessible interface page for a complete list of the members of the interface and a detailed description of each method. The Content of Descriptive Properties page provides information on the expected value of these properties.

The Name, Role, and State properties are required for all user interface components that can gain focus. The Value property may be required for some user interface components but not for others. More information on which members of the IAccessible interface should be implemented for a given user interface component can be found in the Choosing Which Properties to Support page.

How an accessibility client obtains an IAccessible object

An accessibility client use one of the following functions to obtain an IAccessible object.

See the Getting an Accessible Object Interface Pointer page for more information on these functions.

What happens when an accessibility client requests an IAccessible object?

When an accessibility client requests an IAccessible object using the AccessibleObjectFromEvent, AccessibleObjectFromPoint, and AccessibleObjectFromWindow functions Microsoft Active Accessibility sends the WM_GETOBJECT message to the appropriate window procedure within the appropriate server application with the lParam parameter set to OBJID_CLIENT. Your Microsoft Active Accessibility (MSAA) server needs to handle this message.

Handling the WM_GETOBJECT Message

When your Microsoft Active Accessibility (MSAA) Server receives the WM_GETOBJECT message it should return the value obtained by passing the IAccessible interface of the object to the LresultFromObject function as follows.



The above code is from the Windows API Accessibility Server sample application.

For more information on handling the WM_GETOBJECT message see the following.

Events

Implementing the IAccessible interface is just one part of the puzzle. You also need to notify fire event notifications. This is done using the NotifyWinEvent function. See Event Constants, System-Level and Object-Level Events, and What Are WinEvents? for more information.

Additional Resources

Additional information can be found in the following locations.

Using Character or String Constants in a Template function

Using Character or String Constants in a Template function

Ben Key:

October 1, 2013; November 10, 2018

 


This article provides several techniques that may be used when creating template functions in which the character type is a template parameter and you must use character or string constants. It also provides the source code for several template functions and macros that may be used to implement one of the proposed solutions.

Problem Description

There are various data types that may be used to represent characters in C++. The most common of these are char and wchar_t. It often is necessary to write code that is capable of handling either type of character.

One alternative is to simply implement the function once for each character type that needs to be supported. There are obvious problems with this approach. The first problem is that this approach causes unnecessary code duplication. It also opens up the possibility that the different implementations of the function will diverge over time as changes are made in one implementation, for example in order to fix bugs, but not in the other implementations of the function.

Consider the following example.

The CommandLineToArgVector function parses a command line string such as that which might be returned by the GetCommandLine function. This function depends on a number of character constants: specifically NULCHAR (’\0’), SPACECHAR (’ ‘), TABCHAR (’), DQUOTECHAR (’”‘), and SLASHCHAR (’\’).

In order to support characters of both char and wchar_t it is necessary to implement the function twice as follows.


inline size_t CommandLineToArgVector(
    const char* commandLine,
    std::vector<std::string>& arg_vector)
{
    arg_vector.clear();
    /* Code omitted. */
    return static_cast<size_t>(arg_vector.size());
}

inline size_t CommandLineToArgVector(
    const wchar_t* commandLine,
    std::vector<std::wstring>& arg_vector)
{
    arg_vector.clear();
    /* Code omitted. */
    return static_cast<size_t>(arg_vector.size());
}

The obvious solution is to implement this as a template function that accepts the character type as a template parameter as follows.


template<typename CharType>
size_t CommandLineToArgVector(
    const CharType* commandLine,
    std::vector< std::basic_string<CharType> >& arg_vector)
{
    arg_vector.clear();
    /* Code omitted. */
    return static_cast<size_t>(arg_vector.size());
}

The only thing that prevents us from doing that is the existence of the character constants. How do you represent the constants so that they are can be of either data type?

Possible Solutions

As is often the case, there are many possible solutions to this problem. This article discusses several of them.

Widen

One possible solution is to make use of std::ctype::widen as follows.


template <typename CharType>
CharType widenChar(
    const char ch, const std::locale& loc = std::locale())
{
    const auto& cType = std::use_facet<std::ctype<CharType>>(loc);
    return cType.widen(ch);
}

This same technique can be extended for use with string literals as follows.


template<typename CharType>
std::basic_string<CharType> widenString(
    const char* str, const std::locale& loc = std::locale())
{
    std::basic_string<CharType> ret;
    if (str == nullptr || str[0] == 0)
    {
        return ret;
    }
    const auto& cType = std::use_facet<std::ctype<CharType>>(loc);
    auto srcLen = std::strlen(str);
    auto bufferSize = srcLen + 32;
    auto tmpPtr = yekneb::make_unique<CharType[]>(bufferSize);
    auto tmp = tmpPtr.get();
    cType.widen(str, str + srcLen, tmp);
    ret = tmp;
    return ret;
}

Then in the template function you can use the character constants in a character type neutral way as follows.


const CharType NULCHAR = widenChar<CharType>('\0');
const CharType SPACECHAR = widenChar<CharType>(' ');
const CharType TABCHAR =  widenChar<CharType>('\t');
const CharType DQUOTECHAR = widenChar<CharType>('\"');
const CharType SLASHCHAR =  widenChar<CharType>('\\');

The problem is that this requires a multibyte to wide character conversion for the wchar_t case. This may not be much of an issue for functions that are not used very often and are not performance critical. However the cost of the multibyte to wide character conversion would not be acceptable in performance critical situations.

The question is how to solve this problem without these performance issues.

Algorithm Traits

Another option is to create an algorithm traits class that includes functions that return each of the constants as follows.


template <typename CharType>
struct CommandLineToArgVector_Traits
{
    CharType NULCHAR();
    CharType SPACECHAR();
    CharType TABCHAR();
    CharType DQUOTECHAR();
    CharType SLASHCHAR();
    CharType* STRING();
};

template<>
struct CommandLineToArgVector_Traits<char>
{
    char NULCHAR()
    {
        return '\0';
    }
    char SPACECHAR()
    {
        return ' ';
    }
    char TABCHAR()
    {
        return '\t';
    }
    char DQUOTECHAR()
    {
        return '\"';
    }
    char SLASHCHAR()
    {
        return '\\';
    }
    char* STRING()
    {
        return "String";
    }
};

template<>
struct CommandLineToArgVector_Traits<wchar_t>
{
    wchar_t NULCHAR()
    {
        return L'\0';
    }
    wchar_t SPACECHAR()
    {
        return L' ';
    }
    wchar_t TABCHAR()
    {
        return L'\t';
    }
    wchar_t DQUOTECHAR()
    {
        return L'\"';
    }
    wchar_t SLASHCHAR()
    {
        return L'\\';
    }
    wchar_t* STRING()
    {
        return L"String";
    }
};

As you can see, it is a simple matter to incorporate string constants in the algorithm traits structure.

The algorithm traits structure can then be used as follows in your template function.


CommandLineToArgVector_Traits<CharType> traits;
const CharType NULCHAR = traits.NULCHAR();
const CharType SPACECHAR = traits.SPACECHAR();
const CharType TABCHAR =  traits.TABCHAR();
const CharType DQUOTECHAR = traits.DQUOTECHAR();
const CharType SLASHCHAR =  traits.SLASHCHAR();
const CharType* STRING = traits.STRING();

This solution does not have the same performance issues that the widenChar solution does but these performance gains come at the cost of a significant increase in the complexity of the code. In addition, there is now a need to maintain two implementations of the Algorithm Traits structure, which means that we are right back where we started from, though admittedly maintaining two implementations of the Algorithm Traits structure is a lot less work than maintaining two implementations of the algorithm.

It would be ideal to develop a solution that has the convenience and simplicity of the widenChar solution without the performance issues. The question is how.

Preprocessor and Template Magic

Fortunately it is possible to solve this problem using some preprocessor and template magic. I found this solution in the Stack Overflow article How to express a string literal within a template parameterized by the type of the characters used to represent the literal.

The solution is as follows.


template<typename CharType>
CharType CharConstantOfType(const char c, const wchar_t w);

template<>
char CharConstantOfType<char>(const char c, const wchar_t /*w*/)
{
    return c;
}
template<>
wchar_t CharConstantOfType<wchar_t>(const char /*c*/, const wchar_t w)
{
    return w;
}

template<typename CharType>
const CharType* StringConstantOfType(const char* c, const wchar_t* w);

template<>
const char* StringConstantOfType<char>(const char* c, const wchar_t* /*w*/)
{
    return c;
}
template<>
const wchar_t* StringConstantOfType<wchar_t>(const char* /*c*/, const wchar_t* w)
{
    return w;
}

#define _TOWSTRING(x) L##x
#define TOWSTRING(x) _TOWSTRING(x)
#define CHAR_CONSTANT(TYPE, STRING) CharConstantOfType<TYPE>(STRING, TOWSTRING(STRING))
#define STRING_CONSTANT(TYPE, STRING) StringConstantOfType<TYPE>(STRING, TOWSTRING(STRING))

Then in the template function you can use the character constants in a character type neutral way as follows.


const CharType NULCHAR = CHAR_CONSTANT(CharType, '\0');
const CharType SPACECHAR = CHAR_CONSTANT(CharType, ' ');
const CharType TABCHAR =  CHAR_CONSTANT(CharType, '\t');
const CharType DQUOTECHAR = CHAR_CONSTANT(CharType, '\"');
const CharType SLASHCHAR =  CHAR_CONSTANT(CharType, '\\');
const CharType* STRING = STRING_CONSTANT(CharType, "String");

Abracadabra. The problem is solved without any code duplication or performance issues.

Article Source Code

The source code for this article can be found on SullivanAndKey.com. The relevant files may be found at the following locations.

  • widen.h: Contains the source of widenChar and widenString.
  • ConstantOfType.h: Contains the source of CharConstantOfType and StringConstantOfType.
  • CmdLineToArgv.h: The header file for the CommandLineToArgVector function.
  • CmdLineToArgv.cpp: The source code for the CommandLineToArgVector function.

 

Cross Platform Conversion Between string and wstring

Cross Platform Conversion Between string and wstring

Ben Key

October 31, 2013; November 09, 2018

 


This article describes a cross platform method of converting between a STL string and a STL wstring. The technique described in this article does not make use of any external libraries. Nor does it make use of any operating system specific APIs. It uses only features that are part of the standard template library.

Problem Description

The Standard Template Library provides the basic_string template class to represents a sequence of characters. It supports all the usual operations of a sequence and standard string operations such as search and concatenation.

There are two common specializations of the basic_string template class. They are string, which is a typedef for basic_string<char>, and wstring, which is a typedef for basic_string<wchar_t>. In addition there are two specializations of the basic_string template class that are new to C++11. The new specializations of the basic_string template class are u16string, which is a typedef for basic_string<char16_t>, and u32string, which is a typedef for basic_string<char32_t>.

A question that is often asked on the Internet is how to convert between string and wstring. This may be necessary if calling an API function that is designed for one type of string from code that uses a different type of string. For example your code might use wstring objects and you may need to call an API function that uses string objects. In this case you will need to convert between a wstring to a string before calling the API function.

Unfortunately the Standard Template Library does not provide a simple means of doing this conversion. As a result, this is a commonly asked question on the Internet. A Google search for convert between string and wstring returns about 76,300 results. Unfortunately many of the answers that are provided are incorrect.

The Incorrect Answer

The most common incorrect answer found on the Internet can be found in the article Convert std::string to std::wstring on Mijalko.com. The solution provided in that article is as follows.


// std::string -> std::wstring
std::string s("string");
std::wstring ws;
ws.assign(s.begin(), s.end());

// std::wstring -> std::string
std::wstring ws(L"wstring");
std::string s;
s.assign(ws.begin(), ws.end());

The problem is that this solution compiles and runs and for the simple string constants used in this example code, appears to yield the correct results. Many sub standard computer programmers will assume that since it works for these very simple strings they have found the correct solution. Then they translate their application to other languages besides English and they are extremely surprised to discover that their application is riddled with bugs that only affect their non English speaking customers!

The problem is that this solution only works correctly for ASCII characters in the range of 0 to 127. If your strings contain even a single character with a numerical value greater than 127, this simple solution will yield incorrect results. In other words, this simple solution will yield incorrect results if your strings contain any characters other than a through z, A through Z, the numerals 0 through 9, and a few punctuation marks.

This means that any application that uses this technique will not support Chinese. It will not even properly support Spanish since the Spanish alphabet contains several characters that are outside of the ASCII character set such as ch (ce hache), ll (elle), and ñ (eñe).

The Correct Solution

This article describes a cross platform solution for this problem that has the following characteristics.

The Functions


namespace yekneb
{

template<typename Target, typename Source>
inline Target string_cast(const Source& source)
{
    return source;
}

template<>
inline std::wstring string_cast(const std::string& source)
{
    std::locale loc;
    return ::yekneb::detail::string_cast::s2w(source, loc);
}

template<>
inline std::string string_cast(const std::wstring& source)
{
    std::locale loc;
    return ::yekneb::detail::string_cast::w2s(source, loc);
}

}

Three additional functions are provided that allow the user to specify the locale parameter.

The functions are used as follows.


std::wstring wIn(L"Hello World! It is truly a wonderful day to be alive.");
std::string sIn("Hello World! It is truly a wonderful day to be alive.");
std::string sOut = yekneb::string_cast<std::string>(wIn);
std::wstring wOut = yekneb::string_cast<std::wstring>(sIn);

The details

The functions s2w and w2s are implemented using the following features of the STL.

The implementation of these functions is as follows.


namespace yekneb
{

namespace detail
{
namespace string_cast
{

inline std::string w2s(const std::wstring& ws, const std::locale& loc)
{
    typedef std::codecvt<wchar_t, char, std::mbstate_t> converter_type;
    typedef std::ctype<wchar_t> wchar_facet;
    std::string return_value;
    if (ws.empty())
    {
        return "";
    }
    const wchar_t* from = ws.c_str();
    size_t len = ws.length();
    size_t converterMaxLength = 6;
    size_t vectorSize = ((len + 6) * converterMaxLength);
    if (std::has_facet<converter_type>(loc))
    {
        const converter_type& converter = std::use_facet<converter_type>(loc);
        if (converter.always_noconv())
        {
            converterMaxLength = converter.max_length();
            if (converterMaxLength != 6)
            {
                vectorSize = ((len + 6) * converterMaxLength);
            }
            std::mbstate_t state;
            const wchar_t* from_next = nullptr;
            std::vector<char> to(vectorSize, 0);
            std::vector<char>::pointer toPtr = to.data();
            std::vector<char>::pointer to_next = nullptr;
            const converter_type::result result = converter.out(
                state, from, from + len, from_next,
                toPtr, toPtr + vectorSize, to_next);
            if (
              (converter_type::ok == result || converter_type::noconv == result)
              && 0 != toPtr[0]
              )
            {
              return_value.assign(toPtr, to_next);
            }
        }
    }
    if (return_value.empty() && std::has_facet<wchar_facet>(loc))
    {
        std::vector<char> to(vectorSize, 0);
        std::vector<char>::pointer toPtr = to.data();
        const wchar_facet& facet = std::use_facet<wchar_facet>(loc);
        if (facet.narrow(from, from + len, '?', toPtr) != 0)
        {
            return_value = toPtr;
        }
    }
    return return_value;
}

inline std::wstring s2w(const std::string& s, const std::locale& loc)
{
    typedef std::ctype<wchar_t> wchar_facet;
    std::wstring return_value;
    if (s.empty())
    {
        return L"";
    }
    if (std::has_facet<wchar_facet>(loc))
    {
        std::vector<wchar_t> to(s.size() + 2, 0);
        std::vector<wchar_t>::pointer toPtr = to.data();
        const wchar_facet& facet = std::use_facet<wchar_facet>(loc);
        if (facet.widen(s.c_str(), s.c_str() + s.size(), toPtr) != 0)
        {
            return_value = to.data();
        }
    }
    return return_value;
}

}
}

}

A GNU/Linux Specific Bug

The above functions function well on Microsoft Windows and MAC OS X. However they do not work as well on GNU/Linux. Specifically the w2s function will fail with one of two errors, depending on whether debugging is enabled.

If debugging is enabled the error will be a double free or corruption error that occurs when the output buffer is being deallocated. The error occurs do to the fact that the call to converter.out will cause the output buffer to be corrupted and far more bytes will be written to the output buffer than the number of bytes that were allocated for the buffer, even if you over allocate the output buffer by a power of 2.

If debugging is not enabled the error will be as follows.

../iconv/loop.c:448: internal_utf8_loop_single: Assertion `inptr - bytebuf > (state->__count & 7)' failed.

These bugs bug only occur when the active locale uses UTF-8.

I have been unable to resolve this issue using just the STL. However, the Boost C++ Libraries found at www.boost.org provide an acceptable solution, the boost::locale::conv::utf_to_utf function. If using Boost is an option for you this problem can be resolved as follows.

First, add the IsUTF8 function to the ::yekneb::detail::string_cast namespace. This function may be used to determine whether or not a given locale uses UTF-8 and thus whether or not the w2s function should use the boost::locale::conv::utf_to_utf function.

The source code for the IsUTF8 function is as follows.


inline bool IsUTF8(const std::locale &loc)
{
    std::string locName = loc.name();
    if (! locName.empty() && std::string::npos != locName.find("UTF-8"))
    {
        return true;
    }
    return false;
}

Then simply add the following code to the w2s function just after the if (ws.empty()) code block.


if (IsUTF8(loc))
{
    return_value = boost::locale::conv::utf_to_utf<char>(ws);
    if (! return_value.empty())
    {
        return return_value;
    }
}

For consistency a similar code block should be added to the s2w function as well.

Article Source Code

The complete source code for this article can be found on SullivanAndKey.com in the header file StringCast.h. You can also see the code in action on ideone.com.


 

Variable Expansion in Strings

Variable Expansion in Strings

Ben Key:

December 6, 2013; November 7, 2018

 


A common task in C and C++ is to build a string out of a template string containing variable placeholders, often called format specifiers, and additional data. This article describes several options that are available for solving this problem and introduces two versions of an ExpandVars function as an alternative to those solutions.

Problem Description

The simplest description of the problem can be best described via an expansion of the classic Hello World program that is so often the first program one learns to write when learning a new programming language.

The simplest implementation of Hello World in C++ is as follows.


#include <iostream>
 
int main()
{
    std::cout << "Hello, world!\n";
    return 0;
}

Source: Variable Expansion in Strings – Example 1


What if you wanted to modify this program to first ask you for your name and then display a more personal greeting? One way to do this is as follows.


#include <iostream>
#include <string>
 
int main()
{
    std::cout << "What is your name?\n";
    std::string name;
    std::getline(std::cin, name);
    std::cout << "Hello, " << name << "!\n";
    return 0;
}

Source: Variable Expansion in Strings – Example 2


The problem with this approach is that it is not very extensible. It can also be very unwieldy when you have multiple variables that you need to print out.

For example, imagine you have the following person structure and you want to display a message containing all of the fields in the person structure.


struct person
{
    std::string firstName;
    std::string middleName;
    std::string lastName;
    std::string streetAddress1;
    std::string streetAddress2;
    std::string city;
    std::string state;
    std::string zip;
};

Source: Variable Expansion in Strings – Example 3


One possible explanation of a function to display a message containing all of the fields in the person structure is as follows.


void PrintPersonWithStream(const person& p)
{
    std::cout
        << "First Name: " << p.firstName << "\n"
        << "Middle Name: " << p.middleName << "\n"
        << "Last Name: " << p.lastName << "\n"
        << "Street Address 1: " << p.streetAddress1 << "\n"
        << "Street Address 2: " << p.streetAddress2 << "\n"
        << "City: " << p.city << "\n"
        << "State: " << p.state << "\n"
        << "Zip: " << p.zip << "\n";
}

Source: Variable Expansion in Strings – Example 3


As you can see, this function is rather unwieldy. It would be far simpler to be able to write something like the following.


void PrintPerson(const person& p)
{
    const char FormatString[] = R"(
First Name: {FormatSpecifier}
Middle Name: {FormatSpecifier}
Last Name: {FormatSpecifier}
Street Address 1: {FormatSpecifier}
Street Address 2: {FormatSpecifier}
City: {FormatSpecifier}
State: {FormatSpecifier}
Zip: {FormatSpecifier}
)";
    std::string message = SomeFormatFunction(
        FormatString,
        p.firstName, p.middleName, p.lastName,
        p.streetAddress1, p.streetAddress2,
        p.city, p.state, p.zip);
    std::cout << message << "\n";
}

In the above example {FormatSpecifier} will be replaced with a bit of text that causes the text of the appropriate variable to be inserted at the appropriate place in the final string. It will vary depending on the solution you use.

The benefit of this type of solution is that it is far less verbose and it is far less work to change the order of variables in the final output and to add a variable.

You might ask why being able to change the order of variable counts. A simple answer is if your program supports several languages and you need to change the order of items such as dates to account for standards used by a given language.

Using the Standard C Library Functions printf or sprintf

One option is to simply make use of the Standard C library functions printf, or if you need to store the output in a string, sprintf. The printf function writes formatted data to stdout. The sprintf function writes formatted data to a string. These functions can be used as follows.

First, add the following function to the person structure.


static std::string GetPrintfFormatString()
{
    static const char FormatString[] = R"(
First Name: %s
Middle Name: %s
Last Name: %s
Street Address 1: %s
Street Address 2: %s
City: %s
State: %s
Zip: %s
)";
    return FormatString;
}

Source: Variable Expansion in Strings – Example 4


Then the functions can be defined as follows.


void PrintPersonWithPrintf(const person& p)
{
    ::printf(
        p.GetPrintfFormatString().c_str(),
        p.firstName.c_str(), p.middleName.c_str(), p.lastName.c_str(),
        p.streetAddress1.c_str(), p.streetAddress2.c_str(),
        p.city.c_str(), p.state.c_str(), p.zip.c_str());
}
 
void PrintPersonWithSPrintf(const person& p)
{
    std::string formatString = p.GetPrintfFormatString();
    size_t outputLen = formatString.length() + p.firstName.length()
        + p.middleName.length() + p.lastName.length()
        + p.streetAddress1.length() + p.streetAddress2.length()
        + p.city.length() + p.state.length() + p.zip.length()
        + 20;
    std::vector<char> buffer(outputLen, 0);
    ::sprintf(
        buffer.data(), formatString.c_str(),
        p.firstName.c_str(), p.middleName.c_str(), p.lastName.c_str(),
        p.streetAddress1.c_str(), p.streetAddress2.c_str(),
        p.city.c_str(), p.state.c_str(), p.zip.c_str());
    std::string out = buffer.data();
    std::cout << out;
}

Source: Variable Expansion in Strings – Example 4


Limitations of sprintf

One limitation of using the sprintf function is that it is not very flexible for international applications. Often the order of words differ from one language to another. One often discussed example is a time and date string.

For example, in the United States date strings are written as {Month}/{Day}/{Year} while in France date strings are written as {Day}/{Month}/{Year} and in Japan date strings are written as {Year}/{Day}/{Month}. There are many other instance in which word order varies from language to language. For more information refer to the Word order article on Wikipedia, [The origin and evolution of word order][], and The Typology of the Word Order of Languages.

One problem with the sprintf function is that it is not possible to change the order of words in the final output by simply changing the order of words in the format string. That is due to the fact that the order of parameters in the code would need to be changed as well.

One solution to this problem is the use of positional specifiers for format strings.

Positional Specifiers for Format Strings

POSIX compatible systems implement an extension to the printf family of functions to add support for positional specifiers for format strings. This extension allows the conversion specifier character % to be is replaced by the sequence “%n$”, where n is a decimal integer in the range [1, {NL_ARGMAX}], giving the position of the argument in the argument list. For more information see the following resources.

The problem for this solution is that this is not universally supported. For example, on Microsoft Windows, the printf family of functions does not support positional specifiers for format strings. Instead this functionality is supported in the printf_p family of functions: see printf_p Positional Parameters and _sprintf_p, _sprintf_p_l, _swprintf_p, _swprintf_p_l. This makes writing cross platform code unnecessarily difficult.

The following code demonstrates the use of positional specifiers for format strings to write a function that will properly format a date string for the United States, France, and Japan.


std::string GetDateFormatString(const std::string& langCode)
{
    if (
       0 == langCode.compare(0, 2, "en")
       || 0 == langCode.compare(0, 2, "EN")
       )
    {
        return std::string("%1$i/%2$i/%3$i");
    }
    else if (
       0 == langCode.compare(0, 2, "fr")
       || 0 == langCode.compare(0, 2, "FR")
       )
    {
        return std::string("%2$i/%1$i/%3$i");
    }
    else if (
       0 == langCode.compare(0, 2, "ja")
       || 0 == langCode.compare(0, 2, "JA")
       || 0 == langCode.compare(0, 2, "jp")
       || 0 == langCode.compare(0, 2, "JP")
       )
    {
        return std::string("%3$i/%2$i/%1$i");
    }
    return std::string("%1$i/%2$i/%3$i");
}
 
std::string GetDateString(
    const std::string& langCode,
    int month, int day, int year)
{
    std::string fmt = GetDateFormatString(langCode);
    std::array<char, 32> buffer;
    buffer.fill(0);
#if defined(_WIN32)
    ::_sprintf_p(buffer.data(), 32, fmt.c_str(), month, day, year);
#else
    ::sprintf(buffer.data(), fmt.c_str(), month, day, year);
#endif
    std::string ret = buffer.data();
    return ret;
}

Source: Variable Expansion in Strings – Example 5


Note that the there is one major drawback of the above GetDateString function, the presence of that nasty #if/#else/#endif block. This is far from ideal. Unfortunately, due to the fact that the _sprintf_p function expects an additional sizeOfBuffer parameter. Therefore you cannot simply do the following.


#if !defined(_WIN32)
#  define _sprintf_p sprintf
#endif

std::string GetDateString(
    const std::string &langCode,
    int month, int day, int year)
{
    std::string fmt = GetDateFormatString(langCode);
    std::array<char, 32> buffer;
    buffer.fill(0);
    // This will not compile on non Windows systems due to the extra parameter.
    _sprintf_p(buffer.data(), 32, fmt.c_str(), month, day, year);
    std::string ret = buffer.data();
    return ret;
}

The following will work as an acceptable alternative, however.


#if defined(_WIN32)
#  define sprintfp _sprintf_p
#else
#  define sprintfp snprintf
#endif

std::string GetDateString(
    const std::string &langCode,
    int month, int day, int year)
{
    std::string fmt = GetDateFormatString(langCode);
    std::array<char, 32> buffer;
    buffer.fill(0);
    sprintfp(buffer.data(), 32, fmt.c_str(), month, day, year);
    std::string ret = buffer.data();
    return ret;
}

Source: Variable Expansion in Strings – Example 6


This leaves one problem that all of the solutions I have discussed so far unsolved. This function uses C style strings. That is, the first parameter of _sprintf_p is expected to be a pre-allocated char array. It does not natively make use of the C++ basic_string class.

The Boost Format library

The Boost C++ Libraries are a collection of free peer-reviewed portable C++ source libraries that work well with the C++ Standard Library and enhance the capabilities of the C++ Standard Library. In fact, some of the features of the C++ Standard Library were first implemented in the Boost C++ Libraries and the Boost C++ Libraries are designed so that they are suitable for eventual standardization.

One of the components of Boost is The Boost Format library. The Boost home page describes The Boost Format library as follows.

The format library provides a class for formatting arguments according to a format-string, as does printf, but with two major differences:

  • format sends the arguments to an internal stream, and so is entirely type-safe and naturally supports all user-defined types.
  • The ellipsis (…) can not be used correctly in the strongly typed context of format, and thus the function call with arbitrary arguments is replaced by successive calls to an argument feeding operator%

The format specification strings used by the Boost Format library use the Unix98 Open-group printf precise syntax. Further information on the format specification strings used by the Boost Format library can be found in the Boost printf format specifications section of the Boost Format library documentation. Note that these are essentially the same format specification strings that are used by the _sprintf_p function. As a result, the GetDateFormatString function can be used with The Boost Format library.

The following function shows how this can be done.


std::string GetDateStringBoost(
    const std::string &langCode, int month, int day, int year)
{
    std::string fmt = GetDateFormatString(langCode);
    std::string ret = boost::str(boost::format(fmt) % month % day % year);
    return ret;
}

Source: Variable Expansion in Strings – Example 7


Self Documenting Format Specification Strings

One problem with the printf style format specification strings is that they require some form of supporting documentation to indicate which part of the format specification string corresponds to which variable. For example, in order for the GetDateFormatString function to be considered complete, a comment should be added to specify that the %1$ component corresponds to the month, the %2$ component corresponds to the day, and the %3$ component corresponds to the year.

It would be idea if this documentation was an inherent part of the format specification string. Consider the following syntax for a format string: “\((month)/\)(day)/$(year).” In this string their is no need for supporting documentation to indicate the meaning of each component of the format specification string.

This technique is commonly referred to as String interpolation or variable interpolation, variable substitution, or variable expansion. Some programming languages have this functionality built in.

For example, the Python programming language supports the Literal String Interpolation feature since Python 3.6. This makes the following possible.


apples = 4
print(f"I have {apples} apples")

Another example is in the C# programming language. C# 6 added the interpolated string feature.


string name = "Mark";
var date = DateTime.Now;
Console.WriteLine($"Hello, {name}! Today is {date.DayOfWeek}, it's {date:HH:mm} now.");

The ExpandVars Function

The YekNeb C++ Code snippets library provides two versions of the ExpandVars function, which provides string interpolation functionality for C++. One version of the function uses nothing beyond the STL. Another version of the function uses the Boost Xpressive library. Both versions of the function return a string in which the variables are expanded based on the values specified in either an environment map or the environment variables. The following formats are supported for variable names.

  • %VarName%
  • %(VarName)
  • %[VarName]
  • %{VarName}
  • $(VarName)
  • $[VarName]
  • ${VarName}
  • #(VarName)
  • #[VarName]
  • #{VarName}

Bash style variable names in the form of $VarName are not supported.

The variable names used by the ExpandVars function may contain word characters, space characters, the ( character, and the ) character. Note that if the variable includes either the ( character or the ) character you should not use the %(VarName) or $(VarName) syntax.

The following is a simplified version of the STL only ExpandVars function.


bool FindVariableString(
    const std::string &str,
    const std::string::size_type pos,
    std::string::size_type &beginVarStringPos,
    std::string::size_type &endVarStringPos,
    std::string::size_type &beginVarNamePos,
    std::string::size_type &endVarNamePos)
{
    const char *TestString = "%$#";
    const char PercentSign = '%';
    const char LeftParenthesis = '(';
    const char LeftSquareBracket = '[';
    const char LeftCurlyBracket = '{';
    const char RightParenthesis = ')';
    const char RightSquareBracket = ']';
    const char RightCurlyBracket = '}';
    beginVarStringPos = std::string::npos;
    endVarStringPos = std::string::npos;
    beginVarNamePos = std::string::npos;
    endVarNamePos = std::string::npos;
    if (str.empty())
    {
        return false;
    }
    beginVarStringPos = str.find_first_of(TestString, pos);
    if (std::string::npos == beginVarStringPos)
    {
        return false;
    }
    if (beginVarStringPos >= str.length() - 1)
    {
        return false;
    }
    char ch = str[beginVarStringPos];
    char ch1 = str[beginVarStringPos + 1];
    if (
       PercentSign == ch
       && LeftParenthesis != ch1 && LeftSquareBracket != ch1
       && LeftCurlyBracket != ch1
       )
    {
        beginVarNamePos = beginVarStringPos + 1;
        endVarStringPos = str.find(PercentSign, beginVarNamePos);
        if (std::string::npos == endVarStringPos)
        {
            return false;
        }
    }
    else if (
       LeftParenthesis != ch1 && LeftSquareBracket != ch1
       && LeftCurlyBracket != ch1
       )
    {
        return false;
    }
    else
    {
        beginVarNamePos = beginVarStringPos + 2;
        char closeChar = 0;
        if (LeftParenthesis == ch1)
        {
            closeChar = RightParenthesis;
        }
        else if (LeftSquareBracket == ch1)
        {
            closeChar = RightSquareBracket;
        }
        else if (LeftCurlyBracket == ch1)
        {
            closeChar = RightCurlyBracket;
        }
        endVarStringPos = str.find(closeChar, beginVarNamePos);
        if (std::string::npos == endVarStringPos)
        {
            return false;
        }
    }
    endVarNamePos = endVarStringPos - 1;
    return true;
}
 
bool StringContainsVariableStrings(const std::string &str)
{
    std::string::size_type beginVarStringPos = 0;
    std::string::size_type endVarStringPos = 0;
    std::string::size_type beginVarNamePos = 0;
    std::string::size_type endVarNamePos = 0;
    bool ret = FindVariableString(str, 0, beginVarStringPos, endVarStringPos, beginVarNamePos, endVarNamePos);
    return ret;
}
 
std::string GetVariableValue(
    const std::string &varName,
    const std::map<std::string, std::string> &env,
    bool &fromEnvMap, bool &valueContainsVariableStrings)
{
    typedef std::map<std::string, std::string> my_map;
    fromEnvMap = false;
    valueContainsVariableStrings = false;
    std::string ret;
    my_map::const_iterator itFind = env.find(varName);
    if (itFind != env.end())
    {
        ret = (*itFind).second;
        if (!ret.empty())
        {
            fromEnvMap = true;
            valueContainsVariableStrings = StringContainsVariableStrings(ret);
        }
    }
    if (ret.empty())
    {
        ret = ::getenv(varName.c_str());
    }
    return ret;
}
 
std::string ExpandVars(
    const std::string &original,
    const std::map<std::string, std::string> &env)
{
    std::string ret = original;
    if (original.empty())
    {
        return ret;
    }
    bool foundVar = false;
    std::string::size_type pos = 0;
    do
    {
        std::string::size_type beginVarStringPos = 0;
        std::string::size_type endVarStringPos = 0;
        std::string::size_type beginVarNamePos = 0;
        std::string::size_type endVarNamePos = 0;
        foundVar = FindVariableString(ret, pos, beginVarStringPos, endVarStringPos, beginVarNamePos, endVarNamePos);
        if (foundVar)
        {
            std::string::size_type varStringLen = endVarStringPos - beginVarStringPos + 1;
            std::string varString = ret.substr(beginVarStringPos, varStringLen);
            std::string::size_type varNameLen = endVarNamePos - beginVarNamePos + 1;
            std::string varName = ret.substr(beginVarNamePos, varNameLen);
            bool fromEnvMap;
            bool valueContainsVariableStrings;
            std::string value = GetVariableValue(varName, env, fromEnvMap, valueContainsVariableStrings);
            if (!value.empty())
            {
                ret = ret.replace(beginVarStringPos, varStringLen, value);
                pos = beginVarStringPos;
            }
            else
            {
                pos = endVarStringPos + 1;
            }
        }
    } while (foundVar);
    return ret;
}

Source: Variable Expansion in Strings – Example 8

The following code demonstrates the use of the ExpandVars function.

std::string GetDateFormatStringExpandVars(const std::string& langCode)
{
    if (
       0 == langCode.compare(0, 2, "en")
       || 0 == langCode.compare(0, 2, "EN")
       )
    {
        return std::string("${month}/${day}/${year}");
    }
    else if (
       0 == langCode.compare(0, 2, "fr")
       || 0 == langCode.compare(0, 2, "FR")
       )
    {
        return std::string("${day}/${month}/${year}");
    }
    else if (
       0 == langCode.compare(0, 2, "ja")
       || 0 == langCode.compare(0, 2, "JA")
       || 0 == langCode.compare(0, 2, "jp")
       || 0 == langCode.compare(0, 2, "JP")
       )
    {
        return std::string("${year}/${day}/${month}");
    }
    return std::string("${month}/${day}/${year}");
}
 
std::string GetDateStringExpandVars(
    const std::string &langCode, int month, int day, int year)
{
    std::string fmt = GetDateFormatStringExpandVars(langCode);
    std::map<std::string,std::string> env{
        {"month", std::to_string(month)},
        {"day", std::to_string(day)},
        {"year", std::to_string(year)}
    };
    std::string ret = ExpandVars(fmt, env);
    return ret;
}

Source: Variable Expansion in Strings – Example 8


The following is a simplified version of a version of the ExpandVars function that uses the Boost Xpressive regex_replace function.


::boost::xpressive::sregex GetRegex()
{
    namespace xpr = ::boost::xpressive;
    xpr::sregex ret =
        "%" >> (xpr::s1 = +(xpr::_w | xpr::_s | "(" | ")")) >> '%'
        | "%(" >> (xpr::s1 = +(xpr::_w | xpr::_s)) >> ')'
        | "%[" >> (xpr::s1 = +(xpr::_w | xpr::_s | "(" | ")")) >> ']'
        | "%{" >> (xpr::s1 = +(xpr::_w | xpr::_s | "(" | ")")) >> '}'
        | "$(" >> (xpr::s1 = +(xpr::_w | xpr::_s)) >> ')'
        | "$[" >> (xpr::s1 = +(xpr::_w | xpr::_s | "(" | ")")) >> ']'
        | "${" >> (xpr::s1 = +(xpr::_w | xpr::_s | "(" | ")")) >> '}'
        | "#(" >> (xpr::s1 = +(xpr::_w | xpr::_s)) >> ')'
        | "#[" >> (xpr::s1 = +(xpr::_w | xpr::_s | "(" | ")")) >> ']'
        | "#{" >> (xpr::s1 = +(xpr::_w | xpr::_s | "(" | ")")) >> '}';
    return ret;
}

struct string_formatter
{
    typedef std::map<std::string, std::string> env_map;
    env_map env;
    mutable bool valueContainsVariables;
    string_formatter()
    {
        valueContainsVariables = false;
    }
    template<typename Out>
    Out operator()(::boost::xpressive::smatch const& what, Out out) const
    {
        bool fromEnvMap;
        bool valueContainsVariableStrings;
        std::string value = GetVariableValue(
            what.str(1), env, fromEnvMap, valueContainsVariableStrings);
        if (fromEnvMap && !value.empty() && valueContainsVariableStrings)
        {
            valueContainsVariables = true;
        }
        if (value.empty())
        {
            value = what[0];
        }
        if (!value.empty())
        {
            out = std::copy(value.begin(), value.end(), out);
        }
        return out;
    }
};

std::string ExpandVarsR(
    const std::string &original,
    const std::map<std::string, std::string> &env)
{
    std::string ret = original;
    if (original.empty())
    {
        return ret;
    }
    string_formatter fmt;
    fmt.env = env;
    fmt.valueContainsVariables = false;
    ::boost::xpressive::sregex envar = GetRegex();
    ret = ::boost::xpressive::regex_replace(original, envar, fmt);
    if (fmt.valueContainsVariables)
    {
        std::string newValue;
        std::string prevValue = ret;
        do
        {
            fmt.valueContainsVariables = false;
            newValue = ::boost::xpressive::regex_replace(prevValue, envar, fmt);
            if (0 == prevValue.compare(newValue))
            {
                break;
            }
            prevValue.erase();
            prevValue = newValue;
        }
        while (fmt.valueContainsVariables);
        if (0 != ret.compare(newValue))
        {
            ret = newValue;
        }
    }
    return ret;
}

Source: Variable Expansion in Strings – Example 9


The source code of the full version of the STL only implementation of the ExpandVars function can be found in ExpandVars.h and ExpandVars.cpp. The source code of the Boost implementation of the ExpandVars function can be found in boost/ExpandVars.h and boost/ExpandVars.cpp.

 

Musings on the Formatting of C and C++ Code

Ben Key: Ben.Key@YekNeb.com

October 4, 2013; October 26, 2018

 

I was watching the video The Care and Feeding of C++’s Dragons. I found it to be very interesting. I especially found the CLang based program that reformats code to be promising. However, there were some things about the tool that I find disturbing. It seems to use some formatting patterns that I think are huge mistakes.

To illustrate, some of the sample code looks like the following.

int functionName(int a, int b, int c,
                 int d, int e, int f,
                 int g, int h, int i) const {
  /* Code omitted. */
  if (test()) {
    /* Code omitted. */
  }
}

There are two problems that I see with formatting code in this way. The first is that on the line where the list of variables is continued, there is a great deal of what I see as entirely unnecessary white space. In my opinion it should be as follows.

int functionName(int a, int b, int c,
  int d, int e, int f,
  int g, int h, int i) const {

Or, my personal preference would be the following.

int functionName(
  int a, int b, int c,
  int d, int e, int f,
  int g, int h, int i) const {

My reasoning is quite simple.

Examine what happens when the name of the function is changed to something longer, as follows, and then it is reformated. The code becomes as follows.

int aMuchLongerFunctionName(int a, int b, int c
                            int d, int e, int f,
                            int g, int h, int i) const {

If you do a diff of the original version of the code and the new code, it will show that three lines have changed instead of showing that only one line has changed. I am aware of the fact that many diff tools have an ignore white space option and that if this option is set it will only show one line as having changed. However, not all diff tools have that option. In addition, some companies and organizations have a strict policy that every line of code that changes must be changed for a purpose associated with a given task. And these companies do not accept changes that are only related to reformatting as being associated with a given task. In essence, if the changed line does not affect the functionality of the code, it is not an acceptable change. And these organizations will deliberately not turn on the ignore white space option and will turn a deaf ear to the argument that they should just enable that option (can you tell that I am speaking from experience?).

If you are in such a situation and you change the name of a function that initially is formatted with the parameter list aligned with the end of the function name and you adhere to a strict “every changed line must have a functional purpose” rule you will inevitably end up with the following.

int aMuchLongerFunctionName(int a, int b, int c
                 int d, int e, int f,
                 int g, int h, int i) const {

This just looks wrong!

There is also another reason for not aligning the parameters with the end of the function name. Consider the following.

int aVeryLongFunctionNameThatGoesBeyondTheEdgeOfTheScreen(int a, int b, int c,
                                                          int d, int e, int f,
                                                          int g, int h, int i) const {

In this case, you cannot see the parameters at all without wasting your time scrolling across the screen.

If you always begin the parameter list on its own line that is indented one level deep as follows, you would not ever have to scroll the screen just to see the parameter list.

int aVeryLongFunctionNameThatGoesBeyondTheEdgeOfTheScreen(
  int a, int b, int c,
  int d, int e, int f,
  int g, int h, int i) const {

The second issue I have is with putting braces at the end of the line. In C and C++ braces are optional for some statements such as if. And lets face the facts, C and C++ is often inconsistently indented. Putting braces at the end leads to more work in the following scenario.

if (aVeryLongTestThatGoesPastTheEdgeOfTheScreen()) {
  /*
       Thousands
 of
         inconsistently indented
    lines of
 code /*
}

Putting the brace at the end of the if line forces someone who is reading the code to hit the end key to determine how much code will only be called if the condition is true, one line or thousands of lines, when they might not give a damned about seeing the end of the test because the beginning of it is enough to tell them whether or not the condition can be true in the scenario they are working on. What if the person knows that in the case they are working on, the function “aVeryLongTestThatGoesPastTheEdgeOfTheScreen” will return false. They really do not need to see the end of the test in this case except to find out how many lines they need to skip past in order to get to code that is relevant to their task. Why not just put the brace on a line by itself and make everyone’s life so much easier? Why force someone to hit the end key just so they can answer the following question. How many lines do I need to skip to get to code that is relevant to my task?

Until C and C++ do as they did in the Go language and make the braces mandatory, I believe braces should never be at the end of the line.

In Go, where the braces are mandatory, it does not matter as much to me because I know that if the code compiles the brace is there and I do not care if I cannot see it. But in C and C++, I do not want you to force me to find and hit the end key just so I can tell where your if statement ends. Of course, that does not mean that I think that the decision to put the brace at the end was a good one for Go. I often use a paren matching feature to skip past the irrelevant code in the scenario I have described. That requires that the caret be on the opening paren. In Go I need to hit the end key anyway just to get the caret on the brace so I can use the paren matching feature to skip past code I do not care about. Why? If the brace were on a line by itself, I do not need to locate and hit the end key. I can just arrow down to the brace line and use the paren matching feature.

I know that these arguments are only relevant to the placement of braces for conditional statements and that they are not relevant to the placement of braces at the beginning of functions. However, I still feel that the opening brace of a function should be on a line by itself for the sake of consistency.

I cannot believe other people have adopted code formatting patterns that to me are so obviously mistakes. Is there something I am missing that makes my arguments invalid?

And before you say, “just hit the end key, it is not that hard,” consider the fact that some people are hunt an peck typists. For some people, any extra key they need to hunt for unnecessarily is an aggravation that interrupts their work flow. I am certain that for some people who are touch typists, hitting one additional key is no big deal, but for hunt and peck typists, it can be.

I for one am a hunt and peck typist despite the fact that I began using computers in 1985 and for me finding the end key just to find out how many lines of code will only get called in the condition is true case is enough of a disruption that I find it to be extremely annoying.

When I first wrote this article back in 2013, I was not aware of the many options for customizing the behavior of clang-format.

Fortunately, you can easily customize the behavior of clang-format. There are numerous Clang-Format Style Options available. For example, you can instruct clang-format to “always break after an open bracket, if the parameters don’t fit on a single line” by setting AlignAfterOpenBracket to AlwaysBreak.

When you use clang-format to format a file it will search for a “.clang-format file located in one of the parent directories of the source file” and load the various formatting options from there. Clang-format also has a number of predefined coding styles to choose from: LLVM, Google, Chromium, Mozilla, and WebKit. You can use the -fallback-style and -style command line arguments to specify the coding style you wish to use. For more information see the ClangFormat manual.

I have begun using clang-format for my own open source projects, and I am pleased with the results. If you are interested, you can take a look at my SnKOpen .clang-format file.

There are various websites that will help you to generate the perfect .clang-format file for your project. One of the best is the clang-format configurator. The Unformat project, which generates a .clang-format file from example codebase may also be worth investigating.

To Brace Or Not To Brace

Summary

In this article I discuss my opinions on when braces should be used to delineate blocks of code in C and C++. In addition I discuss my views on where in the code the braces should be placed. I use examples from thirteen years of experience in C and C++ programming to back up my opinions.

Discussion

One commonly debated topic in C and C++ programming is whether or not braces should be used with if, while, and for statements in C and C++. The debate stems from the fact that if the if, while, or for statement requires exactly one statement after the test, braces are not required. For example, the following is allowed in the C and C++ language specifications:

if ({test})
    {statement}

while ({test})
    {statement}

for ({start}; {test}; {next})
    {statement}

According to the C and C++ language specifications braces are only considered mandatory if more than one statement is to be executed when the {test} evaluates to true. In fact, the C and C++ language specifications allow {statement} to be on the same line as the {test}.

However, it is my professional opinion that {statement} should never be placed on the same line as the {test}. In addition, braces should be considered mandatory.

First I will discuss the basis for my opinion that {statement} should never be placed on the same line as the {test} in if, for, and while statements.

Consider the following code snippet:

if (foo()) bar();
    baz();

When tracing through this code snippet in a debugger the debugger will stop on the

if (foo()) bar();

line. When the user uses the “step over” command, the debugger stops on the

baz();

line. The debugger gives no indication of whether or not the function bar was ever called.

If this code snippet were written as follows,

if (foo())
    bar();
baz();

the following will happen as the user steps through the code. First, the debugger will stop on the

if (foo())

line. When the user uses the “step over” command, the debugger will stop on the bar line if foo returned true. Otherwise the debugger will stop on the baz line next. By simply changing the formatting so that the call to bar is on its own line the code becomes much easier to debug and the user no longer has any doubt about whether or not the function bar was called. For this reason, the {statement} should never be placed on the same line as the {test} of a if, for, or while statement.

Some will argue that the user can use the step in command to determine if bar is called in the original version of the if statement. However, it is not practical to do so. This is because the first time the step in command is used on the

if (foo()) bar();

line, the debugger will step into foo. The user will then have to use the step out command to return to the function containing the if statement and use the step in command again to determine whether or not bar is called.

Matters are worse if the {test} of the if statement is more complicated such as the following:

if ((foo1() || foo2() || foo3()) && foo4()) bar();

In this case the user will need to use the step in, step out, step in sequence as many as four times just to find out if bar is called. Expecting someone to go to this much trouble to determine if a single function is called is simply unreasonable.

Next I will discuss the basis for my opinion that braces should be considered mandatory.

First, changes over time are easier to track if braces are considered to be mandatory. Consider the following function in which the if statement is written without braces:

/* revision 1 */
void fun()
{
    if (foo())
        bar();
    baz();
}

Every application changes over time. Lets say that the function changes so that the function bar1 needs to be called in addition to the function bar if foo returns true. The function fun becomes as follows:

/* revision 2 */
void fun()
{
    if (foo())
    {
        bar();
        bar1();
    }
    baz();
}

If you use a tool such as diff to determine what the changes between revision 1 and 2 of this function, it will indicate that three lines of code changed. The first change is the addition of the open brace. The second change is the addition of the call to bar1 after the call to bar. The third change is the addition of the closing brace. However, there was only one line of code that changed the actual functionality of the function fun.

If revision 1 of fun were written as follows:

/* revision 1 */
void fun()
{
    if (foo())
    {
        bar();
    }
    baz();
}

then diff would indicate that only one line had changed.

Next considering braces to be mandatory protects you from possible mistakes by developers making changes to your code when they are in a hurry and under a lot of pressure. Consider the original version of the function fun listed above. Lets assume that a developer wished to modify the function so that it would write a message to a log file when fun is about to call bar, but they are in a hurry or perhaps had just finished a task in Python which uses indentation and not braces to delineate code blocks and forget to add the braces. Then the function becomes as follows:

/* revision 2 */
void fun()
{
    if (foo())
        log("fun calling bar because foo returned TRUE.");
        bar();
    baz();
}

This code will compile without warnings. However, it will change the behavior of fun in an obviously unwanted way in that fun is now calling bar even if foo does not return TRUE. Fortunately it is easy to tell that this change in behavior was not intended in this case.

The problem becomes more complicated in situations in which instead of adding code to log the function call, the task is to have a function be called before bar if foo returns TRUE. Again lets assume that the developer is in a hurry or still has Python on his mind so he forgets to add the braces. Then the function fun becomes as follows:

/* revision 2 */
void fun()
{
    if (foo())
        fun1();
        bar();
    baz();
}

This code will also compile without any warnings. However, determining if this change is in error is not as easy as in the first case in which the change was the addition of a line of code intended for logging. By just looking at the code can you tell with 100% certainty that the developer who made this change did not intend to change the function fun so that bar is called all the time without asking the developer who made the change? If you are using a source control tool such as subversion to track changes to your software over time and the developer provides detailed change descriptions it is possible that you could. However, under most circumstances, you could not be 100% certain that the change in behavior was not intentional without talking to the developer who made the change. Then what will you do if the developer had died or is unavailable for some other reason?

If braces are considered mandatory, this problem will never come up in your project.

The final reason that braces should be considered mandatory is that it eases code navigation in modern text editors. Most modern text editors have a brace matching capability that allows you to jump to the matching brace. In if, for, and while statements this lets you jump to the end of the statement with a single command. For simple if, for, and while statements this makes no difference. However, there are cases in which the braces for one statement are optional rule is misused and code is written like this.

if ({test})
    if ({test1})
        if ({test2})
            for ({start}; {test3}; {next})
            {
                /*
                several thousand lines of code
                */
            }

In the case that you are reading through this code and you know that {test} does not return TRUE, you do not care about what happens if {test} returns TRUE. You want to move past the for loop to find out what happens if {test} returns FALSE. If braces were present for the “if ( {test} )” statement, you could simply press down arrow once and then use the move to matching brace command to move on to that section of code. However, there are no braces so you have to arrow down four times before using the move to matching brace command. If this same code were written as follows, the extra three keystrokes would not be necessary.

if ({test})
{
    if ({test1})
    {
        if ({test2})
        {
            for ({start}; {test3}; {next})
            {
                /*
                several thousand lines of code
                */
            }
        }
    }
}

There is also a debate about where the braces should be placed in code. In all my examples the opening brace is located on its own line. However, many programmers prefer to place the opening brace at the end of the if, for, or while line as follows:

if ({test}) {
    {statement}
}

This is perfectly legal according to the C and C++ language specifications. However it is my opinion that this should never be done, that the opening brace should always be placed on its own line. Consider the following:

if ({AVeryLongAndComplicatedTestThatGoesOffTheRightEdgeOfTheScreen}) {
    {statement}
    {statement1}
    /*
    several thousand more lines of code
    */
}

In this case, assuming that the test actually does go off the right edge of the screen, can you tell with absolute certainty that {statement1} and the several thousand additional lines of code will only get called if the test returns TRUE without going to the trouble of using the end key to determine whether or not the if line ends in a brace? Simply depending on indentation is not an accurate indicator. This is because the C and C++ language specification allows for different levels of indentation to be used in the same block of code. For example the following is legal in C and C++.

if ({test})
{
    {statement}
        {statement1}
{statement3}
    {statement4}
}

The fact is that in code where braces are placed at the end of the if, for, or while line, someone reading the code must go through the trouble to using the end key every time a if, for, or while line is encountered that goes off the right edge of the screen in order to determine whether or not multiple lines of code or a single line of code gets called when the test returns TRUE. This simply makes the job of reviewing the code much more difficult.

Too summarize, braces should be considered mandatory in if, for, and while statements in order to make tracking changes over time easier, to protect you from the harried programmer phenomena, and to make navigating through your code easier. In addition, the {statement} should never be placed on the same line as the {test} in if, for, or while statements in order to make it easier to debug your code. Finally, braces should never be placed at the end of the if, for, or while line in order to make it easier to determine whether one statement or many statements get called if the {test} returns TRUE when the test is long enough that it actually goes off the right edge of the screen.

 

Splitting a string in C++

Splitting a string in C++

Ben Key:

June 11, 2013; Updated November 25, 2018

Introduction

A common task in programming is to split a delimited string into an array of tokens. For example, it may be necessary to split a string containing spaces into an array of words. This is one area where programming languages like Java and Python surpass C++ since both of these programming languages include support for this in their standard libraries while C++ does not. However, this task can be accomplished in C++ in various ways.

In Java this task could be accomplished using the String.split method as follows.


String Str = "The quick brown fox jumped over the lazy dog.";
String[] Results = Str.split(" ");

In Python this task could be accomplished using the str.split method as follows.


Str = "The quick brown fox jumped over the lazy dog."
Results = Str.split()

In C++ this task is not quite so simple. It can still be accomplished in a variety of different ways.

Using the C runtime library

One option is to use the C runtime library. The following C runtime library functions can be used to split a string.

Unfortunately, there are various differences between platforms. This necessitates the use of C preprocessor directives to determine what is appropriate for the current platform.

The following code demonstrates the technique.


char* FindToken(
    char* str, const char* delim, char** saveptr)
{
#if (_SVID_SOURCE || _BSD_SOURCE || _POSIX_C_SOURCE >= 1 \
  || _XOPEN_SOURCE || _POSIX_SOURCE)
    return ::strtok_r(str, delim, saveptr);
#elif defined(_MSC_VER) && (_MSC_VER >= 1800)
    return strtok_s(token, delim, saveptr);
#else
    return std::strtok(token, delim);
#endif
}

wchar_t* FindToken(
    wchar_t* token, const wchar_t* delim, wchar_t** saveptr)
{
#if ( (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L)) \
  || (defined(__cplusplus) && (__cplusplus >= 201103L)) )
    return std::wcstok(token, delim, saveptr);
#elif defined(_MSC_VER) && (_MSC_VER >= 1800)
    return wcstok_s(token, delim, saveptr);
#else
    return std::wcstok(token, delim);
#endif
}

char* CopyString(char* destination, const char* source)
{
    return std::strcpy(destination, source);
}

wchar_t* CopyString(wchar_t* destination, const wchar_t* source)
{
    return std::wcscpy(destination, source);
}

template <class charType>
size_t splitWithFindToken(
    const std::basic_string<charType>& str,
    const std::basic_string<charType>& delim,
    std::vector< std::basic_string<charType> >& tokens)
{
    std::unique_ptr<charType[]> ptr = std::make_unique<charType[]>(str.length() + 1);
    memset(ptr.get(), 0, (str.length() + 1) * sizeof(charType));
    CopyString(ptr.get(), str.c_str());
    charType* saveptr;
    charType* token = FindToken(ptr.get(), delim.c_str(), &saveptr);
    while (token != nullptr)
    {
        tokens.push_back(token);
        token = FindToken(nullptr, delim.c_str(), &saveptr);
    }
    return tokens.size();
}

Using the basic_istringstream class

Solution 1: std::istream_iterator

Perhaps the simplest method of accomplishing this task is using the basic_istringstream class as follows.


template <class charType>
size_t splitWithStringStream(
    const std::basic_string<charType>& str,
    std::vector< std::basic_string<charType> >& tokens)
{
    typedef std::basic_string<charType> my_string;
    typedef std::vector< std::basic_string<charType> > my_vector;
    typedef std::basic_istringstream<
        charType, std::char_traits<charType> >
        my_istringstream;
    typedef std::istream_iterator<
        std::basic_string<charType>, charType,
        std::char_traits<charType> >
        my_istream_iterator;
    tokens.clear();
    if (str.empty())
    {
        return 0;
    }
    my_istringstream iss(str);
    std::copy(
        my_istream_iterator{iss}, my_istream_iterator(),
        std::back_inserter<my_vector>(tokens));
    return tokens.size();
}

The splitWithStringStream function can be used as follows.


std::string str("The quick brown fox jumped over the lazy dog.");
std::vector<std::string> tokens;
size_t s = splitWithStringStream(str, tokens);

The splitWithStringStream function has the advantage of using nothing beyond functions that are part of the C++ standard library. To use it you just need to include the following C++ standard library headers: algorithm, iterator, sstream, string, and vector.

An alternate version of the function is as follows.


template <class charType>
size_t splitWithStringStream1(
    const std::basic_string<charType>& str,
    std::vector< std::basic_string<charType> >& tokens)
{
    typedef std::basic_string<charType> my_string;
    typedef std::vector< std::basic_string<charType> > my_vector;
    typedef std::basic_istringstream<
        charType, std::char_traits<charType> >
        my_istringstream;
    typedef std::istream_iterator<
        std::basic_string<charType>, charType,
        std::char_traits<charType> >
        my_istream_iterator;
    tokens.clear();
    if (str.empty())
    {
        return 0;
    }
    my_istringstream iss(str);
    std::vector<my_string> results(
        my_istream_iterator{iss}, my_istream_iterator());
    tokens.swap(results);
    return tokens.size();
}

The splitWithStringStream and the splitWithStringStream1 functions do have two drawbacks. First, the functions are potentially inefficient and slow since the entire string is copied into the stream, which takes up just as much memory as the string. Second, they only support space delimited strings.

Solution 2: std::getline

The following function makes it possible to use a character other than space as the delimiter.


template<typename charType>
size_t splitWithGetLine(
    const std::basic_string<charType>& str,
    const charType delim,
    std::vector< std::basic_string<charType> >& tokens)
{
    typedef std::basic_string<charType> my_string;
    typedef std::basic_istringstream<
        charType, std::char_traits<charType> >
        my_istringstream;
    tokens.clear();
    if (str.empty())
    {
       return 0;
    }
    my_istringstream iss(str);
    my_string token;
    while (std::getline(iss, token, delim))
    {
        tokens.push_back(token);
    }
    return tokens.size();
}

This function can be used as follows.


std::wstring str(L"This is a test.||This is only a test.|This concludes this test.");
std::vector<std::wstring> tokens;
size_t s = splitWithGetLine(str, L'|', tokens);

Note that this solution does not skip empty tokens, so the above example will result in tokens containing four items, one of which would be an empty string.

This function, like the splitWithStringStream and splitWithStringStream1 functions, is potentially inefficient and slow. It does allow the delimiter character to be specified. However, it only supports a single delimiting character. This function does not support strings in which several delimiting characters may be used.

To use the splitWithGetLine function you just need to include the following C++ standard library headers: algorithm, iterator, sstream, string, and vector.

Using only members of the basic_string class

Solution 1

It is possible to accomplish this task using only member functions of the basic_string class. The following function allows you to specify the delimiting character and uses only the find_first_not_of, find, and substr members of the basic_string class. The function also has optional parameters that allow you to specify that empty tokens should be ignored and to specify a maximum number of segments that the string should be split into.


template<typename charType>
size_t splitWithBasicString(
    const std::basic_string<charType>& str,
    const charType delim,
    std::vector< std::basic_string<charType> > &tokens,
    const bool trimEmpty = false,
    const size_t maxTokens = (size_t)(-1))
{
    typedef std::basic_string<charType> my_string;
    typedef typename my_string::size_type my_size_type;
    tokens.clear();
    if (str.empty())
    {
        return 0;
    }
    my_size_type len = str.length();
    // Skip delimiters at beginning.
    my_size_type left = str.find_first_not_of(delim, 0);
    size_t i = 1;
    if (!trimEmpty && left != 0)
    {
        tokens.push_back(my_string());
        ++i;
    }
    while (i < maxTokens)
    {
        my_size_type right = str.find(delim, left);
        if (right == my_string::npos)
        {
            break;
        }
        if (!trimEmpty || right - left > 0)
        {
            tokens.push_back(str.substr(left, right - left));
            ++i;
        }
        left = right + 1;
    }
    if (left < len)
    {
        tokens.push_back(str.substr(left));
    }
    return tokens.size();
}

This function does not suffer from the same potential performance issues as the stream based functions and it allows you to specify the delimiting character. However, it only supports a single delimiting character. This function does not support strings in which several delimiting characters may be used.

To use the splitWithBasicString function you just need to include the following C++ standard library headers: string and vector.

Solution 2

Sometimes the string that is to split uses several different delimiting characters. At other times it may simply be impossible to know for certain in advance what delimiting characters are used. In these cases you may know that the delimiting character could be one of several possibilities. In this case it is necessary for the function to be able to accept a string containing each possible delimiting character. This too can be accomplished using only member functions of the basic_string class.

The following function allows you to specify the delimiting character and uses only the find_first_not_of, find_first_of, and substr members of the basic_string class. The function also has optional parameters that allow you to specify that empty tokens should be ignored and to specify a maximum number of segments that the string should be split into.


template<typename charType>
size_t splitWithBasicString(
    const std::basic_string<charType>& str,
    const std::basic_string<charType>& delim,
    std::vector< std::basic_string<charType> >& tokens,
    const bool trimEmpty = false,
    const size_t maxTokens = (size_t)(-1))
{
    typedef std::basic_string<charType> my_string;
    typedef typename my_string::size_type my_size_type;
    tokens.clear();
    if (str.empty())
    {
       return 0;
    }
    my_size_type len = str.length();
    // Skip delimiters at beginning.
    my_size_type left = str.find_first_not_of(delim, 0);
    size_t i = 1;
    if (!trimEmpty && left != 0)
    {
        tokens.push_back(my_string());
        ++i;
    }
    while (i < maxTokens)
    {
        my_size_type right = str.find_first_of(delim, left);
        if (right == my_string::npos)
        {
           break;
        }
        if (!trimEmpty || right - left > 0)
        {
            tokens.push_back(str.substr(left, right - left));
            ++i;
        }
        left = right + 1;
    }
    if (left < len)
    {
       tokens.push_back(str.substr(left));
    }
    return tokens.size();
}

Using Boost

Boost is a collection of peer-reviewed, cross-platform, open source C++ libraries that are designed to complement and extend the C++ standard library. Boost provides at least two methods for splitting a string.

Solution 1

One option is to use the boost::algorithm::split function in the Boost String Algorithms Library.

In order to use the split function simply include boost/algorithm/string.hpp and then call the function as follows.


std::string str(" The  quick brown fox\tjumped over the lazy dog.");
std::vector<std::string> strs;
boost::split(strs, str, boost::is_any_of("\t "));

Solution 2

Another option is to use the Boost Tokenizer Library. In order to use the Boost Tokenizer Library simply include boost/tokenizer.hpp. Then you can use the Boost Tokenizer as follows.


typedef boost::char_separator<char> my_separator;
typedef boost::tokenizer<my_separator> my_tokenizer;
std::string str(" The  quick brown fox\tjumped over the lazy dog.");
my_separator sep(" \t");
my_tokenizer tokens(str, sep);
my_tokenizer::iterator itEnd = tokens.end();
for (my_tokenizer::iterator it = tokens.begin(); it != itEnd; ++it)
{
    std::cout << *it << std::endl;
}

Using the C++ String Toolkit Library

Another option is to use the C++ String Toolkit Library. The following example shows how the strtk::parse function can be used to split a string.


std::string str("The quick brown fox jumped over the lazy dog.");
std::vector<std::string> tokens;
strtk::parse(str, " ", tokens);

Other Options

Of course there are many other options. Feel free to refer to the web pages listed in the references section below for many other options.

Summary

In this article I discussed several options for splitting strings in C++.

The code for the basic_istringstream class, the basic_string class, and Boost along with a complete sample demonstrating the use of the the functions can be found on Ideone.

References

On the perils of assuming file path manipulation is easy

I recently worked on a bug in which a product developed by my employer was no longer finding user settings files. Everything worked correctly in a prior version of the product but failed in the current version.

The following is a simplified version of the old code.

BOOL GetUserProfilePath(LPWSTR profilePathName)
{
    if (profilePathName == nullptr) return FALSE;
    profilePathName[0] = static_cast<wchar_t>(0);
    std::wstring userPath = GetUserPath();
    if (userPath.empty()) return FALSE;
    TCHAR tempPath[MAX_PATH];
    GetProgramPath(tempPath);
    ::PathAppend(tempPath, userPath.c_str());
    ::PathCanonicalize(profilePathName, tempPath);
    if (profilePathName[0])
    {
        return TRUE;
    }
    return FALSE;
}

The following is a simplified version of the new code.

BOOL GetUserProfilePath(std::wstring& profilePathName)
{
    profilePathName.clear();
    std::wstring userPath = GetUserPath();
    if (userPath.empty()) return FALSE;
    wstring tempPath;
    GetProgramPath(tempPath);
    Path::Append(tempPath, userPath);
    wchar_t temp[MAX_PATH];
    PathCanonicalize(temp, tempPath.c_str());
    profilePathName = temp;
    if (!profilePathName.empty())
    {
        return TRUE;
    }
    return FALSE;
}

At first glance these two implementations appear to be equivalent.

Now, here are a few additional details that reveal why they are not in any way equivalent.

First, due to a previously unknown bug, the GetUserPath function returned a complete path, not a relative path. It was, in fact, the program path. Thus if the program path was “c:\MyApp,” then the value returned by the GetUserPath function was “c:\MyApp.”

Second, the GetProgramPath function also obtains the program path, thus in this example its value would also be “c:\MyApp.”

Thus, when the ::PathAppend Win32 API function was being called it was being asked to append a full path onto another full path. Microsoft chose to, in their infinite wisdom, attempt to protect you from this mistake by trying to *just do the right thing* by generating a valid path despite your mistake. Thus, the bug in the GetUserPath function was harmless.

The person who wrote the Path::Append function was not aware of this. The following was their implementation of the function.

namespace Path
{
    // Various details left out.
    bool Append(std::wstring& dest, const std::wstring& source)
    {
        Path::AddBackslash(dest);
        dest += source;
        return !dest.empty();
    }
}

Therefore, the end result of this change was that the otherwise harmless bug in the GetUserPath function suddenly became a big deal. Before this change, the GetUserProfilePath function returned the string “c:\MyApp” regardless of the bug in the GetUserPath function. After this change the GetUserProfilePath function returned the string “c:\MyApp\c:\MyApp,” which is an invalid path.

I was asked to fix this with the smallest change possible. The GetUserPath function happens to be part of a deprecated component that we are trying to phase out. Therefore, I was not allowed to touch it. As a result, I chose to fix it by simply modifying the Path::Append function so that it uses the ::PathAppend Win32 API function instead of attempting to do the work itself.

I am sharing this in the hopes that it could spare you the difficulties I had over the three days it took me to diagnose and fix this bug.