Regression with a small dataset

Asked by At

We examined a software which was supposedly used for cracking. We discovered that the working time depends significantly on input length N, especially when N is greater than 10-15. During our tests, we fixed the following working times.

N = 2 - 16.38 seconds 
N = 5 - 16.38 seconds 
N = 10 - 16.44 seconds 
N = 15 - 18.39 seconds 
N = 20 - 64.22 seconds 
N = 30 - 65774.62 seconds

Tasks: of Find the program working times for the following three cases - N = 25, N = 40 and N = 50.

I tried to do polynomial regression but the predictions varied from degree 2,3, ...

# Importing the libraries 
import numpy as np 
import matplotlib.pyplot as plt 

# Importing the dataset 
X = np.array([[2],[5],[10],[15],[20],[30]])
X_predict = np.array([[25], [40], [50]])
y = np.array([[16.38],[16.38],[16.44],[18.39],[64.22],[65774.62]])
#y = np.array([[16.38/60],[16.38/60],[16.44/60],[18.39/60],[64.22/60],[65774.62/60]])


# Fitting Polynomial Regression to the dataset 
from sklearn.preprocessing import PolynomialFeatures 

poly = PolynomialFeatures(degree = 11) 
X_poly = poly.fit_transform(X) 

poly.fit(X_poly, y) 
lin2 = LinearRegression() 
lin2.fit(X_poly, y) 

# Visualising the Polynomial Regression results 
plt.scatter(X, y, color = 'blue') 

plt.plot(X, lin2.predict(poly.fit_transform(X)), color = 'red') 
plt.title('Polynomial Regression') 


plt.show() 

# Predicting a new result with Polynomial Regression 
lin2.predict(poly.fit_transform(X_predict))

For degree 2 the results were

array([[ 32067.76147835],
       [150765.87808383],
       [274174.84800471]])

For degree 5 the results were

array([[  10934.83739791],
       [ 621503.86217946],
       [2821409.3915933 ]])

2 Answers

1
James Phillips On Best Solutions

After equation search I was able to fit the data to the equation "seconds = a * exp(b * N) + Offset" with fitted parameters a = 2.5066753490350954E-05, b = 7.2292352155213369E-01, and Offset = 1.6562196782144639E+01 giving RMSE = 0.2542 and R-squared = 0.99999. This combination of data and equation is extremely sensitive to initial parameter estimates. As you can see, it should interpolate with high accuracy within the data range. Since the equation is simple it will likely extrapolate well outside the data range. As I understand your description, if different computer hardware is used or if the cracking algorithm is parallelized then this solution would not match those changes.

enter image description here

1
CrafterKolyan On

As this program is used for cracking, it may use some sort of brute force which leads to exponential performance time, so it's much better to find solution as

y = a + b * c^n

for example:

16.38 + 2.01^n / 20000

You can try predicting log(time) instead of time in LinearRegression