FR

pcosmos.ca

Philippe Choquette's universe

Home
News
Profile
Contact
Half-Life
Music
PCASTL
Computer Science
Videos
Readings
OpenGL
Elements
C64 sids
Links
ICU Example
Boyer-Moore
Merge Sort
Computers

ICU C++ use example on MacOS

When I copy files from my Mac to my NAS, their names with composed accentuated characters are automatically decomposed for the copies. This causes my backup tool to take thoses copies as files without originals. See Unicode Normalization Forms for composed characters examples. As a solution, I wrote a program that renames all the original files to their decomposed form. Thus, the automatic decomposition then changes nothing. Here are the first steps that led to this solution.

Install of icu4c in Brew:

brew install icu4c

Install of pkgconf in Brew:

brew install pkgconf

Display of /opt in Finder:

sudo chflags nohidden /opt

Setting PKG_CONFIG_PATH to the right value:

PKG_CONFIG_PATH=/opt/homebrew/Cellar/icu4c@77/77.1/lib/pkgconfig
export PKG_CONFIG_PATH

transliterate example:

#include <iostream>
#include <string>
#include <unicode/unistr.h>
#include <unicode/translit.h>

int main(void)
{
    std::string init("t\xC3\xA4st"); // täst

    icu::UnicodeString ustrc = icu::UnicodeString::fromUTF8(init.c_str());

    const char16_t *ustrc_buf = ustrc.getBuffer();
    for (int i = 0; i < ustrc.length(); i++)
    {
        std::cout << std::hex << ustrc_buf[i] << " ";
    }
    std::cout << std::endl;

    UErrorCode status = U_ZERO_ERROR;
    icu::Transliterator *myTrans = icu::Transliterator::createInstance("Any-NFD",
        UTRANS_FORWARD, status);

    myTrans->transliterate(ustrc);
    for (int i = 0; i < ustrc.length(); i++)
    {
        std::cout << std::hex << ustrc_buf[i] << " ";
    }
    std::cout << std::endl;

    std::string result;
    icu::StringByteSink<std::string> bs(&result);
    ustrc.toUTF8(bs);

    return 0;
}

Explanation:
0xC3 0xA4 is the UTF-8 encoding of the composed ä.
See Unicode Character ä
Any-NFD is a predefined transliteration rule from any to the Normalization Form D (NFD) (the Canonical Decomposition).

To build:

c++ -o example example.cpp -std=c++17 `pkg-config --libs --cflags icu-uc icu-i18n`

The icu-uc parameter is necessary for the data types and the icu-i18n parameter is necessary to link with createInstance and transliterate.
Reference : How To Use ICU

The program displays:

74 e4 73 74
74 61 308 73 74

Because 0x00E4 is the UTF-16 encoding of the composed ä.
U+0061 with U+0308 is the decomposition of ä (0x0061 is "a" and 0x0308 is the trema alone). result will begin with 0x74 0x61 0xCC 0x88 because the UTF-8 encoding of the trema is 0xCC 0x88.

Mobile
linkedin
bandcamp
steam