Determining memory allocation size for string input in C (scanf)

2.3k views Asked by At

I want to get a string input from a user in C.

I know how to use scanf to get string inputs. I know how to use malloc and realloc to allocate space for a variable.

The problem is when the user enters a string, I don't necessarily know what size that will be that I will need to reserve space for.

For instance if they write James I'd need to malloc((sizeof(char) * 5) but they might have written Bond in which case I would have only had to malloc(sizeof(char) * 4).

Is it just the case that I should be sure to allocate enough space beforehand (e.g. malloc(sizeof(char) * 100)).

And then does scanf do any realloc trimming under the hood or is that a memory leak for me to fix?

3

There are 3 answers

0
chqrlie On

There are multiple approaches to this problem:

  • use an arbitrary maximum length, read the input into a local array and allocate memory based on actual input:

    #include <stdio.h>
    #include <string.h>
    
    char *readstr(void) {
        char buf[100];
        if (scanf("%99s", buf) == 1)
            return strdup(buf);
        else
            return NULL;
    }
    
  • use non-standard library extensions, if supported and if allowed. For example the GNU libc has an m modifier for exactly this purpose:

    #include <stdio.h>
    
    char *readstr(void) {
        char *p;
        if (scanf("%ms", &p) == 1)
            return p;
        else
            return NULL;
    }
    
  • read input one byte at a time and reallocate the destination array on demand. Here is a simplistic approach:

    #include <ctype.h>
    #include <stdio.h>
    #include <stdlib.h>
    
    char *readstr(void) {
        char *p = NULL;
        size_t i = 0;
        int c;
        while ((c = getchar()) != EOF) {
            if (isspace(c)) {
                if (i > 0) {
                    ungetc(c, stdin);
                    break;
                }
            } else {
                char *newp = realloc(p, i + 2);
                if (newp == NULL) {
                    free(p);
                    return NULL;
                }
                p = newp;
                p[i++] = c;
                p[i] = '\0';
            }
        }
        return p;
    }
    
0
David C. Rankin On

You have two misunderstandings you are struggling with. First scanf() does not modify the storage in any way (omitting for purposes of discussion the non-standard "%a", later renamed "%m" specifiers). Second, you are forgetting to provide length + 1 characters of storage to ensure room for the null-terminating character.

In your statement "For instance if they write "James" I'd need to malloc((sizeof(char)*5)" - no, no you would need malloc (6) to provide room for James\0. Note also that sizeof (char) is defined as 1 and should be omitted.

As to how to read a string, you generally want to avoid scanf() and even when using scanf() unless you are reading whitespace separated words, you don't want to use the "%s" conversion specifier which stops reading as soon as it encounters whitespace making it impossible to read "James Bond". Further, you have the issue of what is left unread in stdin after your call to scanf().

When reading using "%s" the '\n' character is left in stdin unread. This is a pitfall that will bite you on your next attempted read if using an input function that does not ignore leading whitespace (that is any character-oriented or line-oriented input function). These pitfalls, along with a host of others associated with scanf() use are why new C programmers are encourage to use fgets() to read user input.

With a sufficiently sized buffer (and if not, with a simple loop) fgets() will consume an entire line of input each time it is called, ensuring there is nothing left unread in that line. The only caveat is that fgets() reads and includes the trailing '\n' in the buffer it fills. You simply trim the trailing newline with a call to strcspn() (which can also provide you with the length of the string at the same time)

As mentioned above, one approach to solve the "I don't know how many characters I have?" problem is to use a fixed-size buffer (character array) and then repeatedly call fgets() until the '\n' is found in the array. That way you can allocate final storage for the line by determining the number of the character read into the fixed-size buffer. It doesn't matter if your fixed-size buffer is 10 and you have 100 characters to read, you simply call fgets() in a loop until the number of characters you read is less than a full fixed-size buffer's worth.

Now ideally, you would size your temporary fixed-size buffer so that your input fits the first time eliminating the need to loop and reallocate, but if the cat steps on the keyboard -- you are covered.

Let's look at an example, similar in function to the CS50 get_string() function. It allows the user to provide the prompt for the user, and reads and allocated storage for the result, returning a pointer to the allocated block containing the string that the user is then responsible for calling free() on when done with it.

#define MAXC 1024       /* if you need a constant, #define one (or more) */

char *getstr (const char *prompt)
{
    char tmp[MAXC], *s = NULL;                      /* fixed size buf, ptr to allocate */
    size_t n = 0, used = 0;                         /* length and total length */
    
    if (prompt)                                     /* prompt if not NULL */
        fputs (prompt, stdout);
    
    while (1) { /* loop continually */
        if (!fgets (tmp, sizeof tmp, stdin))        /* read into tmp */
            return s;
        tmp[(n = strcspn (tmp, "\n"))] = 0;         /* trim \n, save length */
        if (!n)                                     /* if empty-str, break */
            break;
        void *tmpptr = realloc (s, used + n + 1);   /* always realloc to temp pointer */
        if (!tmpptr) {                              /* validate every allocation */
            perror ("realloc-getstr()");
            return s;
        }
        s = tmpptr;                                 /* assign new block to s */
        memcpy (s + used, tmp, n + 1);              /* copy tmp to s with \0 */
        used += n;                                  /* update total length */
        if (n + 1 < sizeof tmp)                     /* if tmp not full, break */
            break;
    }
    
    return s;       /* return allocated string, caller responsible for calling free */
}

Above, a fixed size buffer of MAXC characters is used to read input from the user. A continual loop calls fgets() to read the input into the buffer tmp. strcspn() is called as the index to tmp to find the number of characters that does not include the '\n' character (the length of the input without the '\n') and nul-terminates the string at that length overwriting the '\n' character with the nul-terminating character '\0' (which is just plain old ASCII 0). The length is saved in n. If the line is empty after the removal of the '\n' there is nothing more to do and the function returns whatever is in s at that time.

If characters are present, the a temporary pointer is used to realloc() storage for the new characters (+1). After validating realloc() succeeded, the new characters are copied to the end of the storage and the total length of characters in the buffer is saved in used which is used as an offset from the beginning of the string. That repeats until you run out of characters to read and the allocated block containing the string is returned (if no characters were input, NULL is returned)

(note: you may also want to pass a pointer to size_t as a parameter that can be updated to the final length before return to avoid having to calculate the length of the returned string again -- that is left to you)

Before looking at an example, let's add debug output to the function so it tells us how many characters were allocated in total. Just add the printf() below before the return, e.g.

    }
    printf (" allocated: %zu\n", used?used+1:used); /* (debug output of alloc size) */
    
    return s;       /* return allocated string, caller responsible for calling free */
}

A short example that loops reading input until Enter is pressed on an empty line causing the program to exit after freeing all memory:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* insert getstr() function HERE */

int main (void) {
    
    for (;;) {
        char *s = getstr ("enter str: ");
        if (!s)
            break;
        puts (s);
        putchar ('\n');
        free (s);
    }
}

Example Use/Output

With MAXC at 1024 there isn't a chance of needing to loop unless the cat steps on the keyboard, so all input is read into tmp and then storage is allocated to exactly hold each input:

$ ./bin/fgetsstr
enter str: a
  allocated: 2
a

enter str: ab
  allocated: 3
ab

enter str: abc
  allocated: 4
abc

enter str: 123456789
  allocated: 10
123456789

enter str:
  allocated: 0

Setting MAXC at 2 or 10 is fine as well. The only thing that changes is the number of times you loop reallocating storage and copying the contents of the temporary buffer to your final storage. E.g. with MAXC at 10, the user wouldn't know the difference in:

$ ./bin/fgetsstr
enter str: 12345678
 allocated: 9
12345678

enter str: 123456789
 allocated: 10
123456789

enter str: 1234567890
 allocated: 11
1234567890

enter str: 12345678901234567890
 allocated: 21
12345678901234567890

enter str:
 allocated: 0

Above you have forced the while (1) loop to execute twice for each string of 10 characters or more. So while you want to set MAXC to some reasonable size to avoid looping, and a 1K buffer is fine considering you will have at minimum a 1M function stack on most x86 or x86_64 computers. You may want to reduce the size if you are programming for a micro-controller with limited storage.

While you could allocate for tmp as well, there really is no need and using a fixed-size buffer is about a simple as it gets for sticking with standard-C. If you have POSIX available, then getline() already provides auto-allocation for any size input you have. That is another good alternative to fgets() -- but POSIX is not standard C (though it is widely available)

Another good alternative is simply looping with getchar() reading a character at a time until the '\n' or EOF is reached. Here you just allocate some initial size for s say 2 or 8 and keep track of the number of characters used and then double the size of the allocation when used == allocated and keep going. You would want to allocate blocks of storage as you would not want to realloc() for every character added (we will omit the discussion of why that is less true today with a mmaped malloc() than it was in the past)

Look things over and let me know if you have further questions.

1
Orion the Constellation On

I personally use the malloc approach, but you need to mind one more thing, you can also then limit the characters accepted with %s in the scanf to match your buffer.

char *string = (char*) malloc (sizeof (char) * 100);
scanf ("%100s", string);

You can then reallocate the memory after getting the string size by using the string function strlen and then adding 1 for the terminator.