LSF serial jobs on HPC performance worse than local sequential executions

161 views Asked by At

I'm learning how to use HPC on our lab's clusters, which uses LSF. I tried a simple serial jobs each of which count the frequency of the words in a text file. I wrote a python code for counting the word frequency named count_word_freq.py, a jobs script named myjob.job as following:

python count_word_freq.py --in ~/books/1.txt --out ~/freqs/freq1.txt
python count_word_freq.py --in ~/books/2.txt --out ~/freqs/freq2.txt
python count_word_freq.py --in ~/books/3.txt --out ~/freqs/freq3.txt
python count_word_freq.py --in ~/books/4.txt --out ~/freqs/freq4.txt
python count_word_freq.py --in ~/books/5.txt --out ~/freqs/freq5.txt

and a lsf script to submit the serial jobs to the selfscheduler:

#!/bin/bash
#BSUB -J test01
#BSUB -P acc_pandeg01a
#BSUB -q alloc
#BSUB -W 20
#BSUB -n 20
#BSUB -m manda
#BSUB -o %J.stdout
#BSUB -eo %J.stderr
#BSUB -L /bin/bash
module load python
module load py_packages
module load selfsched
# And run the program; output will be on stdout
mpirun selfsched < myjobs001.jobs

The python code is as following:

def readBookAsFreqDict(infile):
    dic = {}
    with open(infile,"r") as file:
        for line in file:
            contents = line.split(" ")
            for cont in contents:
                if str.isalpha(cont):
                    if cont not in dic.keys():
                        dic[cont] = 1
                    else:
                        dic[cont] = dic[cont] + 1
    return dic

import sys
import argparse
import time as T
if __name__ == "__main__":
    start = T.time()
    parser = argparse.ArgumentParser()
    parser.add_argument('--i', type=str, help = 'input file')
    parser.add_argument('--o', type=str, help = 'output file')
    args = parser.parse_args()
    dic = readBookAsFreqDict(args.i)
    outfile = open(args.o,"w")
    for key,freq in dic.iteritems():
       outfile.write(key + ":" + str(freq) + "\n")
    end = T.time()
    print (end - start)

The 5 input texts are almost the same size of around 3.5 MB. My question is that the CPU time for running this serial job is 980s, which is worse than running it sequentially.

To my understanding, the selfscheduler can automatically assign the 5 jobs to empty nodes, thus can save the running time for running it sequentially. Is that because the execution time for each job is too short compared to the time to find an empty node? Is there any other approaches can be used to make it faster?

Thank you!

0

There are 0 answers