How it is now
I currently have a script running under windows that frequently invokes recursive file trees from a list of servers.
I use an AutoIt (job manager) script to execute 30 parallel instances of lftp (still windows), doing this:
lftp -e "find .; exit" <serveraddr>
The file used as input for the job manager is a plain text file and each line is formatted like this:
<serveraddr>|...
where "..." is unimportant data. I need to run multiple instances of lftp in order to achieve maximum performance, because single instance performance is determined by the response time of the server.
Each lftp.exe instance pipes its output to a file named
<serveraddr>.txt
How it needs to be
Now I need to port this whole thing over to a linux (Ubuntu, with lftp installed) dedicated server. From my previous, very(!) limited experience with linux, I guess this will be quite simple.
What do I need to write and with what? For example, do I still need a job man script or can this be done in a single script? How do I read from the file (I guess this will be the easy part), and how do I keep a max. amount of 30 instances running (maybe even with a timeout, because extremely unresponsive servers can clog the queue)?
Thanks!
Parallel processing
I'd use GNU/parallel. It isn't distributed by default, but can be installed for most Linux distributions from default package repositories. It works like this:
will execute
echo arg1
and andecho arg2
in parallel.So the most easy approach is to create a script that synchronizes your server in bash/perl/python - whatever suits your fancy - and execute it like this:
parallel ./script ::: server1 server2
The script could look like this:
lftp
seems to be available for Linux as well, so you don't need to change the FTP client.To run max. 30 instances at a time, pass a
-j30
like this:parallel -j30 echo ::: 1 2 3
Reading the file list
Now how do you transform specification file containing
<server>|...
entries to GNU/parallel arguments? Easy - first, filter the file to contain just host names:sed
is used to replace things using regular expressions, and more. This will strip everything (.*
) after the first|
up to the line end ($
). (While|
normally means alternative operator in regular expressions, in sed, it needs to be escaped to work like that, otherwise it means just plain|
.)So now you have list of servers. How to pass them to your script? With
xargs
!xargs
will put each line as if it was an additional argument to your executable. For examplewill run
So in your case you should do
Caveats
Be sure not to save the results to the same file in each parallel task, otherwise the file will get corrupt - coreutils are simple and don't implement any locking mechanisms unless you implement them yourself. That's why I redirected the output to
$server-files.txt
rather thanfiles.txt
.