C++ libm based program taking too much time on baremetal ubuntu server 16 compared to VM ubuntu server 12

155 views Asked by At

I am trying to run a math intensive C++ program on Ubuntu server and surprisingly the Ubuntu Server 16 running on a baremetal Core i7 6700 is taking more time than a dual core Ubuntu server 12.04.5 running on a VM over windows 10 on the same machine. Its totally surprising to see this result. Am using the GCC version 5.4.1 on both. Also tried compiling using the -Ofast and -ffast-math but didn't make any difference. Also tried fetching the latest gcc 7.2 on the bare metal but again it didn't make any difference whatsoever. Also tried fetching the latest libm (glibc) and tried with no difference in the numbers at all. Can someone please help in letting me know where things are going wrong?

Also running callgrind over the program (am using a third party so library and so have no control over it), I see most of the time being spent in libm. The only difference between the two environments other than the server version is the libm version. On VM which performed well it was 2.15 and on the bare metal which takes more time it is 2.23. Any suggestions will be greatly appreciated. Thanks.

The build command is :

g++ -std=c++14 -O3 -o scicomplintest EuroFutureOption_test.cpp -L. -lFEOption

The program is to calculate option greeks for a set of 22 strike prices using a library whose source code isn't available. However would be able to answer any questions w.r.t the test code.

Have simplified the latency calculation using the class below:

typedef std::chrono::high_resolution_clock::time_point TimePoint;
typedef std::chrono::high_resolution_clock SteadyClock;

template <typename precision = std::chrono::microseconds>
class EventTimerWithPrecision
{
public:
    EventTimerWithPrecision() { _beg = SteadyClock::now(); }

    long long elapsed() {
        return  std::chrono::duration_cast<precision>(SteadyClock::now() 
                    -       _beg).count();
    }

    void reset() { _beg = SteadyClock::now(); }

private:
    TimePoint _beg;
};

typedef EventTimerWithPrecision<> EventTimer;

Now am getting the times as below:

Ubuntu server 12.04.5 on VM with dual core (over windows 10):
siril@ubuntu:/media/sf_workshare/scicompeurofuturestest$ ./scicomplintest
Mean time: 61418 us
Min time: 44990 us
Max time: 79033 us

Ubuntu server 16 on Core i7 6700 bare metal:
Mean time: 104888 us
Min time: 71015 us
Max time: 125928 us

on Windows 10 (MSVC 14) on Core i7 6700 bare metal:
D:\workshare\scicompeurofuturestest\x64\Release>scicompwintest.exe
Mean time: 53322 us
Min time: 39655 us
Max time: 64506 us

I can understand windows 10 performing faster than linux on VM but why is the baremetal ubuntu so slow?

Unable to get to any conclusion am pasting the whole test code below. Please help (really curious to know why its behaving so).

#include <iostream>
#include <vector>
#include <numeric>
#include <algorithm>

#include "FEOption.h"
#include <chrono>

#define PRINT_VAL(x) std::cout << #x << " = " << (x) << std::endl

typedef std::chrono::high_resolution_clock::time_point TimePoint;
typedef std::chrono::high_resolution_clock SteadyClock;

template <typename precision = std::chrono::microseconds>
class EventTimerWithPrecision
{
public:
    EventTimerWithPrecision() { _beg = SteadyClock::now(); }

    long long elapsed() {
    return  std::chrono::duration_cast<precision>(SteadyClock::now() - _beg).count();
    }

    void reset() { _beg = SteadyClock::now(); }

private:
    TimePoint _beg;
};

typedef EventTimerWithPrecision<> EventTimer;

int main(){
    int cnt, nWarmup = 10, nTimer = 100000;
    double CompuTime;

    // Option Parameters
    double Omega[] = {
        -1,
        -1,
        -1,
        1,
        1,
        1,
        1,
        -1,
        -1,
        -1,
        1,
        1,
        1,
        1,
        -1,
        -1,
        -1,
        1,
        1,
        1,
        1,
        -1,
        -1,
        -1,
        1,
        1,
        1,
        1,
        -1,
        -1,
        -1,
        1,
        1,
        1,
        1,
        -1,
        -1,
        -1,
        1,
        1,
        1,
        1
    };
    double Strike[] = {
        92.77434863,
        95.12294245,
        97.5309912,
        100,
        102.5315121,
        105.1271096,
        107.7884151,
        89.93652726,
        93.17314234,
        96.52623599,
        100,
        103.598777,
        107.327066,
        111.1895278,
        85.61884708,
        90.16671558,
        94.95615598,
        100,
        105.311761,
        110.90567,
        116.796714,
        80.28579206,
        86.38250571,
        92.9421894,
        100,
        107.5937641,
        115.7641807,
        124.5550395,
        76.41994703,
        83.58682355,
        91.4258298,
        100,
        109.3782799,
        119.6360811,
        130.8558876,
        73.30586976,
        81.30036598,
        90.16671558,
        100,
        110.90567,
        123.0006763,
        136.4147241
    };

    double Expiration[] = {
        7,
        7,
        7,
        7,
        7,
        7,
        7,
        14,
        14,
        14,
        14,
        14,
        14,
        14,
        30,
        30,
        30,
        30,
        30,
        30,
        30,
        60,
        60,
        60,
        60,
        60,
        60,
        60,
        90,
        90,
        90,
        90,
        90,
        90,
        90,
        120,
        120,
        120,
        120,
        120,
        120,
        120
    };

    int TradeDaysPerYr = 252;

    // Market Parameters
    double ValueDate = 0;
    double Future = 100;
    double annualSigma = 0.3;
    double annualIR = 0.05;

    // Numerical Parameters
    int GreekSwitch = 2;
    double annualSigmaBump = 0.01;
    double annualIRBump = 0.0001;
    double ValueDateBump = 1;

    double PV;
    double Delta;
    double Gamma;
    double Theta;
    double Vega;
    double Rho;

    sciStatus_t res;

    int nData = sizeof(Strike) / sizeof(double);
    std::vector<long long> v(nData);

    for (int i = 0; i < nData; i++)
    {

        for (cnt = 0; cnt < nWarmup; ++cnt){
            res = EuroFutureOptionFuncC(annualIR, annualSigma, Omega[i], ValueDate, Expiration[i], Future, Strike[i], TradeDaysPerYr, annualIRBump + cnt*1.0e-16,
                annualSigmaBump, ValueDateBump, GreekSwitch,
                &PV,
                &Delta,
                &Gamma,
                &Theta,
                &Vega,
                &Rho
                );
            if (res != SCI_STATUS_SUCCESS) {
                std::cout << "Failure with error code " << res << std::endl;
                return -1;
            }
        }
    EventTimer sci;

        for (cnt = 0; cnt < nTimer; ++cnt){
            res = EuroFutureOptionFuncC(annualIR, annualSigma, Omega[i], ValueDate, Expiration[i], Future, Strike[i], TradeDaysPerYr, annualIRBump + cnt*1.0e-16,
                annualSigmaBump, ValueDateBump, GreekSwitch,
                &PV,
                &Delta,
                &Gamma,
                &Theta,
                &Vega,
                &Rho
                );
            if (res != SCI_STATUS_SUCCESS) {
                std::cout << "Failure with error code " << res << std::endl;
                return -1;
            }
        }

        v[i] = sci.elapsed();
    }

    long long sum = std::accumulate(v.begin(), v.end(), 0);
    long long mean_t = (double)sum / v.size();
    long long max_t = *std::max_element(v.begin(), v.end());
    long long min_t = *std::min_element(v.begin(), v.end());

    std::cout << "Mean time: " << mean_t << " us" << std::endl;
    std::cout << "Min time: " << min_t << " us" << std::endl;
    std::cout << "Max time: " << max_t << " us" << std::endl;
    std::cout << std::endl;

    PRINT_VAL(PV);
    PRINT_VAL(Delta);
    PRINT_VAL(Gamma);
    PRINT_VAL(Theta);
    PRINT_VAL(Vega);
    PRINT_VAL(Rho);

    return 0;
}

The callgrind graph is as follow: callgrind graph

More updates: Tried -fopenacc and -fopenmp on both baremetal and vm ubuntu on the same g++ 7.2. The vm showed a little improvement but the baremetal ubuntu is showing the same number again and again. Also since the majority of the time spent is in libm, is there any way to upgrade that library ?(glibc) ? Don't see any new version of it in apt-cache though

Used callgrind and plotted a graph using dot. According to that it takes 42.27% time in libm exp (version 2.23) and 15.18% time in libm log.

Finally found a similar post (so pasting it here for others): The program runs 3 times slower when compiled with g++ 5.3.1 than the same program compiled with g++ 4.8.4, the same command

The problem as suspected was from the libs (according to the post). And by setting the LD_BIND_NOW the execution times came down drastically (and now less than VM). Also that post has couple of links to bugs that were filed for that version of glibc. Will go through and will give more details here. However thanks for all the valuable inputs.

0

There are 0 answers