C++ STL contains a variety of different character conversion methods, which are mainly divided into inherited C-style functional conversion and C++ facet objects.
Local Environments and Facets
A locale is a collection of facets that describe the local character encoding. Facets describe various text processing schemes.
For example, in the default "C" In the local environment, the local narrow multibyte encoding refers to ASCII code, while in "Chinese (Simplified)_China.936" In the local environment, the local narrow multi-byte encoding refers to the GBK code. "en_US.UTF-8" In the local environment, . is treated as a decimal point by the numerical analysis facets, and "de_DE.UTF-8" Local exchange, , Treated as decimal points by numerical analysis facets.
It should be noted that the C language global local environment and the C++ global local environment may be independent of each other.
Character type
First of all, in order to facilitate understanding, we must first have an understanding of the character type in C++.
C++ includes char , wchar_t , char8_t , char16_t 和 char32_t These are the character types.
In the latest C++ standard, we can assume that char Type represents the native narrow multibyte encoding,wchar_t represents the local wide multi-byte encoding, and their specific encodingDepends on local environmentThe other three types represent UTF-8, UTF-16, and UTF-32 encodings, respectively, and are independent of the local environment.
So usually, a function can define the encoding to be processed through parameter types or template type parameters.
However, due to historical reasons, char8_t That is, the type dedicated to representing UTF-8 encoding was added to the standard in C++20, but UTF-8 encoded literals were added to the standard in C++11 and allowed to be directly assigned to char ,and so char For a long time it was allowed to have UTF-8 encoded characters, and this is where the chaos began.
For the reasons above, before C++20, some standard library tools will char Some standard library tools are treated as native narrow multi-byte encodings, while others are treated as UTF-8 encodings. After C++20, for compatibility, although these standard library tools are marked as deprecated, they are still retained and occupy a position, resulting in behaviors that do not meet the expectations of C++20.
C-style functional conversion
In the C local environment, the local environment is managed by a state machine, that is, there is only one global local environment, which can be std::setlocale After the function is modified, all functions that depend on the local environment will use the new local environment description text, for example std::printf Etc. At the same time, the global state machine means it cannot be used in a multi-threaded environment.
Here is a brief introduction to the functional conversion of C language. The main documentation is in Here .
First is the local encoding char 和 wchar_t The conversion between std::mbsrtowcs 和 std::wcsrtombs function, in actual testing, they will char Treats the encoding as a native narrow multibyte encoding and performs the correct conversion, subject to the influence of the global locale.
But for Unicode encoding, added in C++11 std::mbrtoc16 , std::c16rtomb , std::mbrtoc32 和 std::c32rtomb will set its parameters char Type is treated as UTF-8 encoded, andNotLiteral local narrow multibyte encodings are not affected by the global local environment, even after C++20.
For C++20 std::mbrtoc8 和 std::c8rtomb Obviously, their parameters char The types represent local narrow multi-byte encodings, but no compiler actually implements them. This means that C-style functional conversions cannot complete the conversion between UTF-8 and the local environment.
C++ Facet Objects
In C++ locales, a locale is defined as a type std::locale , and users can create multiple independent local environment objects, so that when using tools that rely on the local environment, they can directly pass the local environment object without modifying the global local environment. This can be used in a multi-threaded environment. At the same time, C++ also provides an interface for modifying the global local environment through objects std::locale::global .
Each type of local object holds a facet collection, so when we need to perform a certain text processing, we only need to get the corresponding facet object through the facet object. std::has_facet The interface queries a local environment to see if it implements the required facets and passes std::use_facet Get the pointer of the corresponding facet from the local environment. Of course, we can also define our own facets and add them to a local environment. I won’t go into details here. You just need to know std::use_facet What we get is the parent pointer of the facet implementation class, and we actually implement it through virtual function polymorphic calls.
The facets used for encoding conversion are std::codecvt , where the first two parameters are the encodings to be converted. The standard library ensures that std::codecvt<char, char, std::mbstate_t> That is, the identity transformation facet and std::codecvt<wchar_t, char, std::mbstate_t> That is, the local narrow multibyte encoding and local wide multibyte encoding conversion facets must be implemented.
As mentioned earlier, although the above two necessary implementations, char represents a native narrow multibyte encoding, but the standard changed in C++11 std::codecvt<char16_t, char, std::mbstate_t> 和 std::codecvt<char32_t, char, std::mbstate_t> When adding must be realized, char Defined as UTF-8 encoding. Even in the specialization of the same tool, char Both interpretations appear! However, the latter is marked as deprecated in C++20.
Here is a use std::codecvt<wchar_t, char, std::mbstate_t> Example of converting encodings.
// 定义输入和输出字符串
char in_str[] = "Hello, world!";
wchar_t out_str[16];
// 定义迭代器 转换结束后将指向转换后的字符串结尾
const char* in_ptr = in_str;
wchar_t* out_ptr = out_str;
// 定义用户偏好本地环境 即系统默认的本地环境
std::locale loc = std::locale("");
// 获取本地窄多字节编码和本地宽多字节编码之间的转换方式 即取出刻面
const auto& facet = std::use_facet<std::codecvt<wchar_t, char, mbstate_t>>(loc);
// 定义转换状态 用于保存转换的中间状态 不过在这里并没有用到 因为转换是一次性的
std::mbstate_t State = std::mbstate_t();
// 执行转换
facet.in(State, in_str, in_str + std::strlen(in_str), in_ptr,
out_str, out_str + sizeof(out_str), out_ptr);
Regarding platform support, the user preference local environment of Windows 11 MSVC implements conversion facets between any two of the five character types, but because the standard was added in C++11 std::codecvt<char16_t, char, std::mbstate_t> 和 std::codecvt<char32_t, char, std::mbstate_t> Faceting, resulting in wchar_t When converting unrelated facets, MSVC is forced to char Defined as UTF-8 encoding, resulting in most char The relevant facets cannot be used. And in the MSVC implementation, char32_t The related conversion facets may return a successful conversion when converting non-Unicode Basic Multilingual Plane characters.Actually incorrectUse with caution.
The user-preferred native environment of Linux GCC implements only a few of the conversion facets required by the standard, but the conversion results are reliable.