I'm trying to write a function that will run the fribidi algorithm on a std::string and return a reordered std::string. I hope it to be safe enough for any std::string, and in case something fails in the way, it can return the original std::string.
I saw many examples online that use std::wstring, but I wonder whether I can avoid this conversion. Here's my attempt (I may have forgotten some includes).
# fribidi-test.cpp
#include <cstring>
#include <iostream>
#include <string>
#include <stdio.h>
#define FRIBIDI_NO_DEPRECATED
#include <fribidi/fribidi.h>
std::string fribidi_str_convert(std::string string_orig) {
std::cerr << "dbg: orig: " + string_orig + "\n";
FriBidiChar fribidi_in_char;
FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(
FRIBIDI_CHAR_SET_UTF8,
string_orig.c_str(),
string_orig.size(),
&fribidi_in_char
);
fprintf(stderr, "len is %i\n", fribidi_len);
// https://github.com/fribidi/fribidi#api
// Let fribidi think about the main direction by it's own (https://stackoverflow.com/q/58166995/4935114)
FriBidiCharType fribidi_pbase_dir = FRIBIDI_TYPE_LTR;
// Prepare output variable
FriBidiChar fribidi_visual_char;
fribidi_boolean stat = fribidi_log2vis(
/* input */
&fribidi_in_char,
fribidi_len,
&fribidi_pbase_dir,
/* output */
&fribidi_visual_char,
NULL,
NULL,
NULL
);
fprintf(stderr, "stat is: %d\n", stat);
if (stat) {
char string_formatted_ptr;
// Convert from fribidi unicode back to ptr
FriBidiStrIndex new_len = fribidi_unicode_to_charset(
FRIBIDI_CHAR_SET_UTF8,
&fribidi_visual_char,
fribidi_len,
&string_formatted_ptr
);
fprintf(stderr, "new_len is: %d\n", new_len);
if (new_len) {
fprintf(stderr, "string_formatted_ptr is: %s\n", &string_formatted_ptr);
std::string string_formatted_out(&string_formatted_ptr, new_len);
return string_formatted_out;
};
};
return string_orig;
};
int main() {
std::string orig = "אריק איינשטיין";
std::cerr << "main: orig: " + orig + "\n";
std::cerr << "main: transformed: " + fribidi_str_convert(orig) + "\n";
};
I compile and run it with:
g++ $(pkg-config --libs fribidi) fribidi-test.cpp -o fribidi-test && ./fribidi-test
My problem is that I'm getting a malformed output:
main: orig: ןייטשנייא קירא
dbg: orig: ןייטשנייא קירא
len is 14
stat is: 2
new_len is: 27
string_formatted_ptr is: אĐןייטשנייא קי
main: transformed: אĐןייטשנייא ק
That Đ character is not supposed to be there. What I want to get is:
main: orig: ןייטשנייא קירא
dbg: orig: ןייטשנייא קירא
len is 14
stat is: 2
new_len is: 27
string_formatted_ptr is: אריק איינשטיין
main: transformed: אריק איינשטיין
Is this related to UTF16 encoding? and the fact that the new length is 27 - almost twice as that of the original length?
This is very wrong. You can't expect to store a string into one character. Char is a char. It is not a pointer. Not a string. Remember to compile your programs with
-fsanitize=undefinedand also check with valgrind.Also,
};- just use}. There is no (need for);after}(in these cases).It's
cstdioin C++.Prefer to
<< string << stringinstead of<< string + stringto (I think) reduce memory allocations.Fribidi API is bad, because I do not see how to calculate memory needed for the
charset_to_unicode. Even thefribidyprogram - https://github.com/fribidi/fribidi/blob/cffa3047a0db9f4cd391d68bf98ce7b7425be245/bin/fribidi-main.c#L64 - just uses a constant amount of super big value. Also, fribidi program is the example that does not use std::wstring, because it is in C.The following program uses a constant big buffer size like fribidi program:
and outputs:
Knowing that
FriBidiCharis uint32_t and fribidi internally uses UTF-32 and thatwchar_ton Linux is UTF-32, it would be preferable to usestd::wstring(orwchar_t) to know how much memory to allocate. You could also count codepoints in UTF-8 input string and then precalculate the length of UTF-8 represetation offribidi_visual_charto allocate memory forfribidi_unicode_to_charset.