how can write code to download in a parallel way?

147 views Asked by At

I would like to obtain a parallel download of a file, for example, if the file size of 54 kb, I would like to blocks of 10kb was downloaded the file's contents.

In addition, I have no more than 5 requests at once. but how? I thought of using the fork (), but not really understand how.

1-10 first request
11-20 second request
21-30 third request
31-40 fourth request
41-50 fifth request


51-54 waits until past one request ends.then it will be execute.

I don't care about the method to get data(recv etc etc). I just want to know how to implement a concurrent method? (better if I can do with fork())

1

There are 1 answers

0
autistic On

There are some readily available software libraries which will provide this functionality. The main one I can think of is curl. You can find an easy introduction to the curl multi library here.

It's usually best to avoid reinventing the wheel unless you have a very good reason (such as improving the world of technology, or for academic research).

For the sake of academic research, and since no "link-only" answers would suffice, I'll elaborate on the one of many possible ways that one could go about multiplexing sockets.


Non-blocking sockets

The first, and currently most portable method is to use non-blocking sockets and/or non-blocking socket calls, however it's important to realise (especially when using the non-blocking socket calls as opposed to setting O_NONBLOCK to the file descriptor): some things will still block. For example, you can't get connect to return immediately unless you set the file descriptor to non-blocking mode, and you of course getaddrinfo (and similar standard name resolution functions) will block, too.

When you use non-blocking files or calls to functions, the functions will return immediately. If there's no data ready, they'll indicate this through their return values. If there's data ready to be processed, again, it'll show through the return value.

There are two ways (that I know of) to ensure non-blocking socket calls (including connect).

  1. For Unix-like systems, call fcntl(socket_fd, fcntl(socket_fd, F_SETFL, fcntl(fd, F_GETFL, 0) | O_NONBLOCK). Following that, all calls to read, write, accept and connect will return immediately, without delay. connect has a few error codes (in errno) specifically for this, such as EALREADY, EINPROGRESS, EISCONN and EWOULDBLOCK which you'll want to check for, because when you enable non-blocking, some error return values are actually success return values in disguise; you need to check errno.
  2. For Windows systems, call ioctlsocket(socket_fd, FIONBIO, (u_long[]){1}). The same semantics will occur as described above, except that the errno codes won't be errno codes (they'll be GetLastError() codes, instead) and they're... probably different values, I don't know. Many of them have similar names, however, so in my projects I usually use something like:

    #ifdef _WIN32
    #define set_nonblock(fd) (ioctlsocket(fd, FIONBIO, (u_long[]){1}) == 0)
    #define EAGAIN           WSAEWOULDBLOCK
    #define EWOULDBLOCK      WSAEWOULDBLOCK
    #define EISCONN          WSAEISCONN
    #define EINPROGRESS      WSAEINPROGRESS
    #define EINVAL           WSAEINVAL
    #define EALREADY         WSAEALREADY
    #else
    #define set_nonblock(fd) (fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) | O_NONBLOCK) != -1)
    #endif
    

Not worthy of mention

Using non-blocking sockets alone, I wouldn't be surprised at all if you manage to maintain several thousand connections with a single thread, on a variety of systems with little tweaking necessary. However, this model is not ideal as you need a busy loop to cycle through each socket, instantaneously testing them for events each loop; rather than your code being triggered to wake up by the OS when an event arrives, for example.

We know that in order to send events to the application, the kernel needs to process the events, so we can give it some time using sleep(0); for example, as a quick fix. This'll see CPU use drop from near 100% to under 10% for certain. However, another method exists of multiplexing numerous (blocking or not) sockets with a non-blocking (or time-out interrupted) function, such that the function will return immediately when some data is available, or will wait until the time expires to receive data.

select has obviously strong benefits, however there are drawbacks, too; namely, the sets are typically restricted to low numbers of sockets; to support large numbers of sockets, you'll need a loop within a loop, as you'll find the 64 socket limit (or whatever it is) runs out quickly. Additionally, it doesn't solve the connect blocking problem (where-as the O_NONBLOCK and ~FIONBIO` method does).

Thus, I'm not going to talk any more about select; I'll describe the other options available to you. Another example with similar limitations is poll; I won't talk about that, either. If you want to know about that, there's plenty on the internet about it...


Note that everything from this point on is quite non-portable (though you might find ways to wrap them all into a common interface, like curl multi does).


Asynchronous socket calls will begin a connection, then return immediately like the non-blocking socket calls, except they'll also raise a signal or call a function which you specify when the connection is complete. This is putting the OS in control of notifying your code when events arrive, rather than the OS waiting for you. It should be clear that asynchronous sockets are ideal as far as optimisation goes, but they're not portable. There are various options per OS:

All of these have something in common, which is that they call a function (or raise a signal, which you could translate to a call to a function) upon success or failure. However, their interfaces aren't so close to being familiar.

Typically, I haven't bothered writing any kind of wrapper for them, as I find the non-blocking sockets I mentioned at the beginning of this answer are more than adequate nowadays. What's important is that I don't need to port it to every system, because... I'm too lazy for that! I'll only optimise for a system when someone shows me it's slow on that system. Otherwise we end up digging ourselves into a torrent of systems people might not ever even use our software on...