mkstemp() implementation for Windows to write temporary files

1.4k views Asked by At

I want to create temporary files in given temporary directory path in Windows through C++. mktemp() does the required job, but it only creates 26 unique files. mkstemp() is working fine in Linux but it is not there in Windows. So please help me to use mkstemp() functionality in Windows or suggest an alternative?

2

There are 2 answers

0
MSalters On

_mktemp (the MSVC name) will replace X with a letter, which is why you can get only 26 different names. There's also _tempnam which uses numbers instead. It should support up to 4 billion files.

0
GTB On

I've dusted off my old RIG (Reusable Interface Glue) library because I used to write OS abstraction layers many years ago; here is a performant, OS-agnostic implementation of mkostemps(), which is like mkstemp() but with optional filename suffix and optional, additional open flags, i.e.

inline int mkstemp(char *pathTmpl) {return rig::mkostemps(pathTmpl, 0, 0);}

The implementation of mkstemp() on Linux usually substitutes the 6 "template" X characters with case-sensitive, alphanumeric characters (so 2 * 26 + 10 = 62 values), and such implementations are sometimes used on Windows also. However, though Windows filenames are nowadays case-preserving, unique names are typically case-insensitive, so such algorithms wastefully attempt duplicate filenames (differing only in upper/lower case). My approach uses a-z and 0-9 for 36**6 possible filenames (which is 2**31 plus about 29 million, i.e. nearly 2.2 billion possible filenames).

About 20 years ago, I came up with a technique to generate a deterministic (though somewhat random looking) sequence of numbers that never repeats until every number in the range has been output. Years later, I stumbled upon similar code in pseudo-random-number generators, where it was dubbed a Weyl sequence. I've therefore styled my XOR-enhanced Weyl sequences X-Weyl sequences. The basic idea is to repeatedly add an odd number to an index into a power-of-2 sized range, with wraparound (i.e. modulus), and XOR the outputs with another randomly-selected constant to make the output appear less predictable.

The following code uses 2 X-Weyl sequences: one to iterate over the range [0, 2**31) for generating random filenames, and one to iterate over [0, 5] – actually 0 to 7, with 6 and 7 skipped — to "randomly" re-order the characters within the filename. I use std::random_device to obtain (hopefully) random entropy to "seed" the X-Weyl sequences; this is good for Windows and most Linux systems, but just in case it's ever built on a system with deterministic random_device output, I threw in a call to time() to at least ensure unique starting conditions when run repeatedly.

// Excerted from RIG - Reusable Interface Glue, Copyright (c) 1995 - 2023, GTB.
// Portions below are freely redistributable, with no warranties.
#include <cinttypes>
#include <climits>
#include <cstring>
#include <errno.h>
#include <time.h>
#include <fcntl.h>
#if defined(_WIN32)
  #include <io.h>
  #define TMP_OPEN_MODE_    (_S_IREAD | _S_IWRITE)
#else
  #include <unistd.h>
  #define TMP_OPEN_MODE_    (S_IRUSR | S_IWUSR)
#endif
#include <random>

// Number of elements in an array:
#define rig_COUNT_OF(_array)            (sizeof(_array) / sizeof((_array)[0]))
// Create a bit mask of the specified number of (least-significant) bits. Must supply a positive
// (non-zero) number of bits, and gives less-efficient code if not a constant.
#define rig_BIT_MASK_EX(_type, _count)                                      \
    ((_type)((_type)((((_type)1 << ((_count) - 1)) - 1) << 1) | (_type)0x1))

namespace rig
{
    // A Weyl sequencer with an XOR mask to optionally "twist" the ordering to be less linear:
    // A Weyl sequence produces every unique value in 2 to the N exactly once before repeating.
    template <typename UTYPE_, unsigned N_BITS_ = CHAR_BIT * sizeof(UTYPE_)> class XWeylSequence
    {
        UTYPE_  prevVal_, delta_, xor_;
        static UTYPE_ Val_(UTYPE_ val)  {return val & rig_BIT_MASK_EX(UTYPE_, N_BITS_);}
      public:
        XWeylSequence(UTYPE_ seedValue, UTYPE_ sequenceStep, UTYPE_ xorMask = 0) :
            prevVal_(Val_(seedValue)), delta_(Val_(sequenceStep) | 0x1), xor_(Val_(xorMask))  {}
        inline void SetSeed(UTYPE_ seedValue) {prevVal_ = Val_(seedValue);}
        inline UTYPE_ Next()  {return Val_((prevVal_ += delta_)) ^ xor_;}
    };
    
    class RandomSeed
    {
        std::random_device  rng_;
        unsigned int        xtra_;  // In case random_device is only pseudo-random
        inline uint32_t Entropy_() {return (uint32_t)(rng_() | xtra_);}
      public:
        RandomSeed() : xtra_((unsigned int)time(nullptr))  {}
        inline uint32_t operator()()
            {return Entropy_() ^ ((sizeof(unsigned) < 4) ? Entropy_() << (CHAR_BIT * 2) : 0);}
    };
    
    int mkostemps(char *pathTmpl, int sfxLen = 0, int oflags = 0)
    {   // Validate input parameters:
        static const char XPatt[] = "XXXXXX";
        char *tmplP = &pathTmpl[pathTmpl ? strlen(pathTmpl) : 0] - sfxLen - (sizeof(XPatt) - 1);
        if ((sfxLen < 0) || (tmplP < pathTmpl) || memcmp(tmplP, XPatt, sizeof(XPatt) - 1))
        {
            errno = EINVAL;
            return -1;
        }
        
        // Each X is replaced with one of 36 values (case-INSENSITIVE alphanumeric characters),
        // giving 36**6 possible values (slightly more than 2**31).
        static const char encodingSet[36 + 1] = "abcdefghijkLmnopqrstuvwxyz0123456789";
        RandomSeed rng;
        uint32_t r[4];
        for (unsigned int idx = 0; idx < rig_COUNT_OF(r); ++idx)
            r[idx] = rng();
        XWeylSequence<uint32_t, 31> seq(r[3], r[2], r[1]);             // Step thru 2**31 values
        XWeylSequence<uint8_t, 3> out(r[0], r[0] >> (6 - 1), r[0] >> 3); //Step thru out indices
        uint32_t baseOffs = r[0] >> 8;  // Capture most of the gap between 36**6 and 2**31
        unsigned long tryLimit = std::max(
      #ifdef TMP_MAX
                                           (unsigned long)TMP_MAX +
      #endif
                                           0ul, 1ul << 18);     // (Linux tries < 2**18 times)
        do  // Follow a randomly-selected X-Weyl sequence until it produces a unique filename:
        {
            uint32_t rv = seq.Next() + baseOffs;
            for (unsigned cnt = 8; cnt; --cnt)
            {   // Randomly-selected order of output indices:
                unsigned idx = out.Next();
                if (idx < 6)    // Operating on [0-7], but only [0-5] are valid indices
                {
                    tmplP[idx] = encodingSet[rv % 36];
                    rv /= 36;
                }
            }
            // Try to create the file:
            int fd = open(pathTmpl, oflags | O_CREAT | O_EXCL | O_RDWR, TMP_OPEN_MODE_);
            if ((fd >= 0) || (errno != EEXIST))
                return fd;
        } while (--tryLimit);  // Retry so long as error EEXIST is returned, until limit reached
        return -1;
    }
} // end namespace rig

I just spent a little time instrumenting the code because I was curious how good it is. I quickly realized testing billions of filenames with actual files on disk is prohibitively slow, due to operating-system search for existing filenames. For instrumentation purposes, I ran it with a 512 MB bitmap marking used filenames so I could see how quickly this code finds unique names; I ran it until it gave up with 86% of the possible filename variations consumed after 5 hours. [Note that with real files involved, I had only reached 210 million files after running overnight, so it would take continuous file creation for well over a week to actually exhaust the unique filename supply, assuming the OS didn't slow to a crawl as the directory size exploded.]

At the 210 million file mark, a histogram of how many tries it took to generate a unique filename looked like this:

After 210501632 temp files created:
1:  197340597
2:  12211367
3:  873750
4:  69692
5:  5643
6:  542
7:  36
8:  5

So after getting well underway, 99.5% of the time a unique filename can be generated in just 2 tries. With about a billion filenames created:

After 1000079360 temp files created:
1:      650706306
2:      206972880
3:      79540728
4:      33792156
5:      14654019
6:      7133227
7:      3483588
8:      1776113
9:      912556
10:     494348
11:     269506
12:     149952
13:     81034
14:     46228
15:     25989
16:     15457
17:     9362
18:     5900
19:     3532
20:     2212
21:     1351
22:     887
23:     566
24:     355
25:     258
26:     195
27:     143
28:     97
29:     63
30:     52
31:     49
32:     41
33:     38
34:     22
35:     22
36:     16
37:     14
38:     16
39:     11
40:     9
41:     10
42:     7
43:     8
44:     1
45:     7
46:     3
48:     1
49:     2
50:     3
51:     1
52:     2
53:     2
54:     1
55:     4
56:     1
57:     1
60:     1
61:     1
62:     1
64:     1
66:     1
83:     1
85:     1
125:    1

So ~99% of unique filenames took no more than 5 tries with a billion unique filenames! Even for the long tail (where once it even required 85 tries, and another time 125), the filename generation is still lightning fast (I noted something like 130,000 unique names/second on a debug build); virtually all the overhead is just the open() call to attempt to create a unique file.