ICU 78.3  78.3
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Data Structures | Typedefs | Enumerations | Functions | Variables
utfiterator.h File Reference

C++ header-only API: C++ iterators over Unicode strings (=UTF-8/16/32 if well-formed). More...

#include "unicode/utypes.h"
#include <iterator>
#include <string>
#include <string_view>
#include <type_traits>
#include "unicode/utf16.h"
#include "unicode/utf8.h"
#include "unicode/uversion.h"

Go to the source code of this file.

Data Structures

struct  U_HEADER_ONLY_NAMESPACE::prv::range_type< Range, typename >
 
struct  U_HEADER_ONLY_NAMESPACE::prv::range_type< Range, std::void_t< decltype(std::declval< Range >().begin()), decltype(std::declval< Range >().end())> >
 
struct  U_HEADER_ONLY_NAMESPACE::prv::is_basic_string_view< T >
 
struct  U_HEADER_ONLY_NAMESPACE::prv::is_basic_string_view< std::basic_string_view< Args...> >
 
class  U_HEADER_ONLY_NAMESPACE::prv::CodePointsIterator< CP32, skipSurrogates >
 
class  U_HEADER_ONLY_NAMESPACE::AllCodePoints< CP32 >
 A C++ "range" over all Unicode code points U+0000..U+10FFFF. More...
 
class  U_HEADER_ONLY_NAMESPACE::AllScalarValues< CP32 >
 A C++ "range" over all Unicode scalar values U+0000..U+D7FF & U+E000..U+10FFFF. More...
 
class  U_HEADER_ONLY_NAMESPACE::UnsafeCodeUnits< CP32, UnitIter, typename >
 Result of decoding a code unit sequence for one code point. More...
 
class  U_HEADER_ONLY_NAMESPACE::CodeUnits< CP32, UnitIter, typename >
 Result of validating and decoding a code unit sequence for one code point. More...
 
class  U_HEADER_ONLY_NAMESPACE::UTFIterator< CP32, behavior, UnitIter, LimitIter, typename >
 Validating iterator over the code points in a Unicode string. More...
 
class  U_HEADER_ONLY_NAMESPACE::UTFStringCodePoints< CP32, behavior, Range >
 A C++ "range" for validating iteration over all of the code points of a code unit range. More...
 
struct  U_HEADER_ONLY_NAMESPACE::UTFStringCodePointsAdaptor< CP32, behavior >
 
class  U_HEADER_ONLY_NAMESPACE::UnsafeUTFIterator< CP32, UnitIter, typename >
 Non-validating iterator over the code points in a Unicode string. More...
 
class  U_HEADER_ONLY_NAMESPACE::UnsafeUTFStringCodePoints< CP32, Range >
 A C++ "range" for non-validating iteration over all of the code points of a code unit range. More...
 
struct  U_HEADER_ONLY_NAMESPACE::UnsafeUTFStringCodePointsAdaptor< CP32 >
 

Typedefs

typedef enum UTFIllFormedBehavior UTFIllFormedBehavior
 Some defined behaviors for handling ill-formed Unicode strings. More...
 
template<typename Iter >
using U_HEADER_ONLY_NAMESPACE::prv::iter_value_t = typename std::iterator_traits< Iter >::value_type
 
template<typename Iter >
using U_HEADER_ONLY_NAMESPACE::prv::iter_difference_t = typename std::iterator_traits< Iter >::difference_type
 

Enumerations

enum  UTFIllFormedBehavior { UTF_BEHAVIOR_NEGATIVE, UTF_BEHAVIOR_FFFD, UTF_BEHAVIOR_SURROGATE }
 Some defined behaviors for handling ill-formed Unicode strings. More...
 

Functions

template<typename CP32 , UTFIllFormedBehavior behavior, typename UnitIter , typename LimitIter = UnitIter>
auto U_HEADER_ONLY_NAMESPACE::utfIterator (UnitIter start, UnitIter p, LimitIter limit)
 UTFIterator factory function for start <= p < limit. More...
 
template<typename CP32 , UTFIllFormedBehavior behavior, typename UnitIter , typename LimitIter = UnitIter>
auto U_HEADER_ONLY_NAMESPACE::utfIterator (UnitIter p, LimitIter limit)
 UTFIterator factory function for start = p < limit. More...
 
template<typename CP32 , UTFIllFormedBehavior behavior, typename UnitIter >
auto U_HEADER_ONLY_NAMESPACE::utfIterator (UnitIter p)
 UTFIterator factory function for a start or limit sentinel. More...
 
template<typename CP32 , typename UnitIter >
auto U_HEADER_ONLY_NAMESPACE::unsafeUTFIterator (UnitIter iter)
 UnsafeUTFIterator factory function. More...
 

Variables

template<typename Iter >
constexpr bool U_HEADER_ONLY_NAMESPACE::prv::forward_iterator
 
template<typename Iter >
constexpr bool U_HEADER_ONLY_NAMESPACE::prv::bidirectional_iterator
 
template<typename Range >
constexpr bool U_HEADER_ONLY_NAMESPACE::prv::range = range_type<Range>::value
 
template<typename T >
constexpr bool U_HEADER_ONLY_NAMESPACE::prv::is_basic_string_view_v = is_basic_string_view<T>::value
 
template<typename CP32 , UTFIllFormedBehavior behavior>
constexpr
UTFStringCodePointsAdaptor
< CP32, behavior > 
U_HEADER_ONLY_NAMESPACE::utfStringCodePoints
 Range adaptor function object returning a UTFStringCodePoints object that represents a "range" of code points in a code unit range, which validates while decoding. More...
 
template<typename CP32 >
constexpr
UnsafeUTFStringCodePointsAdaptor
< CP32 > 
U_HEADER_ONLY_NAMESPACE::unsafeUTFStringCodePoints
 Range adaptor function object returning an UnsafeUTFStringCodePoints object that represents a "range" of code points in a code unit range. More...
 

Detailed Description

C++ header-only API: C++ iterators over Unicode strings (=UTF-8/16/32 if well-formed).

Sample code:

#include <string_view>
#include <iostream>
#include "unicode/utypes.h"
using icu::header::utfIterator;
using icu::header::utfStringCodePoints;
using icu::header::unsafeUTFIterator;
using icu::header::unsafeUTFStringCodePoints;
int32_t rangeLoop16(std::u16string_view s) {
// We are just adding up the code points for minimal-code demonstration purposes.
int32_t sum = 0;
for (auto units : utfStringCodePoints<UChar32, UTF_BEHAVIOR_NEGATIVE>(s)) {
sum += units.codePoint(); // < 0 if ill-formed
}
return sum;
}
int32_t loopIterPlusPlus16(std::u16string_view s) {
auto range = utfStringCodePoints<char32_t, UTF_BEHAVIOR_FFFD>(s);
int32_t sum = 0;
for (auto iter = range.begin(), limit = range.end(); iter != limit;) {
sum += (*iter++).codePoint(); // U+FFFD if ill-formed
}
return sum;
}
int32_t backwardLoop16(std::u16string_view s) {
auto range = utfStringCodePoints<UChar32, UTF_BEHAVIOR_SURROGATE>(s);
int32_t sum = 0;
for (auto start = range.begin(), iter = range.end(); start != iter;) {
sum += (*--iter).codePoint(); // surrogate code point if unpaired / ill-formed
}
return sum;
}
int32_t reverseLoop8(std::string_view s) {
auto range = utfStringCodePoints<char32_t, UTF_BEHAVIOR_FFFD>(s);
int32_t sum = 0;
for (auto iter = range.rbegin(), limit = range.rend(); iter != limit; ++iter) {
sum += iter->codePoint(); // U+FFFD if ill-formed
}
return sum;
}
int32_t countCodePoints16(std::u16string_view s) {
auto range = utfStringCodePoints<UChar32, UTF_BEHAVIOR_SURROGATE>(s);
return std::distance(range.begin(), range.end());
}
int32_t unsafeRangeLoop16(std::u16string_view s) {
int32_t sum = 0;
for (auto units : unsafeUTFStringCodePoints<UChar32>(s)) {
sum += units.codePoint();
}
return sum;
}
int32_t unsafeReverseLoop8(std::string_view s) {
auto range = unsafeUTFStringCodePoints<UChar32>(s);
int32_t sum = 0;
for (auto iter = range.rbegin(), limit = range.rend(); iter != limit; ++iter) {
sum += iter->codePoint();
}
return sum;
}
char32_t firstCodePointOrFFFD16(std::u16string_view s) {
if (s.empty()) { return 0xfffd; }
auto range = utfStringCodePoints<char32_t, UTF_BEHAVIOR_FFFD>(s);
return range.begin()->codePoint();
}
std::string_view firstSequence8(std::string_view s) {
if (s.empty()) { return {}; }
auto range = utfStringCodePoints<char32_t, UTF_BEHAVIOR_FFFD>(s);
auto units = *(range.begin());
if (units.wellFormed()) {
return units.stringView();
} else {
return {};
}
}
template<typename InputStream> // some istream or streambuf
std::u32string cpFromInput(InputStream &in) {
// This is a single-pass input_iterator.
std::istreambuf_iterator bufIter(in);
std::istreambuf_iterator<typename InputStream::char_type> bufLimit;
auto iter = utfIterator<char32_t, UTF_BEHAVIOR_FFFD>(bufIter);
auto limit = utfIterator<char32_t, UTF_BEHAVIOR_FFFD>(bufLimit);
std::u32string s32;
for (; iter != limit; ++iter) {
s32.push_back(iter->codePoint());
}
return s32;
}
std::u32string cpFromStdin() { return cpFromInput(std::cin); }
std::u32string cpFromWideStdin() { return cpFromInput(std::wcin); }

Definition in file utfiterator.h.

Typedef Documentation

Some defined behaviors for handling ill-formed Unicode strings.

This is a template parameter for UTFIterator and related classes.

When a validating UTFIterator encounters an ill-formed code unit sequence, then CodeUnits.codePoint() is a value according to this parameter.

Draft:
This API may be changed in the future versions and was introduced in ICU 78
See Also
CodeUnits
UTFIterator
UTFStringCodePoints

Enumeration Type Documentation

Some defined behaviors for handling ill-formed Unicode strings.

This is a template parameter for UTFIterator and related classes.

When a validating UTFIterator encounters an ill-formed code unit sequence, then CodeUnits.codePoint() is a value according to this parameter.

Draft:
This API may be changed in the future versions and was introduced in ICU 78
See Also
CodeUnits
UTFIterator
UTFStringCodePoints
Enumerator
UTF_BEHAVIOR_NEGATIVE 

Returns a negative value (-1=U_SENTINEL) instead of a code point.

If the CP32 template parameter for the relevant classes is an unsigned type, then the negative value becomes 0xffffffff=UINT32_MAX.

Draft:
This API may be changed in the future versions and was introduced in ICU 78
UTF_BEHAVIOR_FFFD 

Returns U+FFFD Replacement Character.

Draft:
This API may be changed in the future versions and was introduced in ICU 78
UTF_BEHAVIOR_SURROGATE 

UTF-8: Not allowed; UTF-16: returns the unpaired surrogate; UTF-32: returns the surrogate code point, or U+FFFD if out of range.

Draft:
This API may be changed in the future versions and was introduced in ICU 78

Definition at line 149 of file utfiterator.h.