How to get character position in a text file encode in UTF-8 in C?

94 views Asked by At

The C Standard specifies that ftell() returns the position of a character from the beginning of the file when it's opened in binary mode.

... obtains the current value of the file position indicator for the stream pointed to by stream. For a binary stream, the value is the number of characters from the beginning of the file. For a text stream, its file position indicator contains unspecified information, usable by the fseek function for returning the file position indicator for the stream to its position at the time of the ftell call; the difference between two such return values is not necessarily a meaningful measure of the number of characters written or read.

If the text file has a wide character, like ñ, then the position of any char after ñ would be greater than the corresponding column in the text file. Just to be specific, what I mean for position here is that the corresponding column if one read the text file as a linear sequence of symbols.

For example, the string " ñ ñññ a ñ a" has 12 char, but printing ftell() inside this loop:

void printPosition(FILE *file){
    
    int c;
    long i;
    while((c=fgetc(file)) != EOF){
        i = ftell(file);
        printf("%c %i\n", c, i);
    }
}

gives the output:

  1
├ 2
▒ 3
  4
├ 5
▒ 6
├ 7
▒ 8
├ 9
▒ 10
  11
a 12
  13
├ 14
▒ 15
  16
a 17

I tried opening in text/binary read mode and got the same result for both.

1

There are 1 answers

1
KamilCuk On

IF your platform supports UTF-8 compatible locale, you can use wide characters to read the file wide char by wide char.

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <errno.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
int main() {
    {
        char *r = setlocale(LC_ALL, "C.UTF-8");
        if (!r) {
            perror("Could not setlocale to UTF-8");
            return EXIT_FAILURE;
        }
    }
    // Create a temporary file with the requested content in your question.
    {
        const char str[] = " ñ ñññ a ñ a";
        FILE *f = fopen("/tmp/temp", "w");
        assert(f);
        int r = fwrite(str, 1, strlen(str), f);
        assert(r == strlen(str));
        r = fclose(f);
        assert(r == 0);
    }
    // Read the file using wide characters.
    {
        FILE *f = fopen("/tmp/temp", "r");
        assert(f);
        unsigned counter = 1;
        wint_t c;
        while ((c = fgetwc(f)) != WEOF) {
            printf("Character %lc at char position %u ftell=%ld\n", c, counter, (long)ftell(f));
            counter++;
        }
        int r = fclose(f);
        assert(r == 0);
    }
}

Executing the program gives the output on godbolt https://godbolt.org/z/cdbrGKPss :

Character   at char position 1 ftell=1
Character ñ at char position 2 ftell=3
Character   at char position 3 ftell=4
Character ñ at char position 4 ftell=6
Character ñ at char position 5 ftell=8
Character ñ at char position 6 ftell=10
Character   at char position 7 ftell=11
Character a at char position 8 ftell=12
Character   at char position 9 ftell=13
Character ñ at char position 10 ftell=15
Character   at char position 11 ftell=16
Character a at char position 12 ftell=17

if one read the text file as a linear sequence of symbols.

I am not sure if "linear sequence of symbols" makes sense in unicode. The required reading on unicode is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) . You might be interested in libunistring and ICU libraries for unicode handling in C.