# Regression with a small dataset

We examined a software which was supposedly used for cracking. We discovered that the working time depends significantly on input length N, especially when N is greater than 10-15. During our tests, we fixed the following working times.

``````N = 2 - 16.38 seconds
N = 5 - 16.38 seconds
N = 10 - 16.44 seconds
N = 15 - 18.39 seconds
N = 20 - 64.22 seconds
N = 30 - 65774.62 seconds
``````

Tasks: of Find the program working times for the following three cases - N = 25, N = 40 and N = 50.

I tried to do polynomial regression but the predictions varied from degree 2,3, ...

``````# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt

# Importing the dataset
X = np.array([,,,,,])
X_predict = np.array([, , ])
y = np.array([[16.38],[16.38],[16.44],[18.39],[64.22],[65774.62]])
#y = np.array([[16.38/60],[16.38/60],[16.44/60],[18.39/60],[64.22/60],[65774.62/60]])

# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree = 11)
X_poly = poly.fit_transform(X)

poly.fit(X_poly, y)
lin2 = LinearRegression()
lin2.fit(X_poly, y)

# Visualising the Polynomial Regression results
plt.scatter(X, y, color = 'blue')

plt.plot(X, lin2.predict(poly.fit_transform(X)), color = 'red')
plt.title('Polynomial Regression')

plt.show()

# Predicting a new result with Polynomial Regression
lin2.predict(poly.fit_transform(X_predict))
``````

For degree 2 the results were

``````array([[ 32067.76147835],
[150765.87808383],
[274174.84800471]])
``````

For degree 5 the results were

``````array([[  10934.83739791],
[ 621503.86217946],
[2821409.3915933 ]])
`````` On Best Solutions

After equation search I was able to fit the data to the equation "seconds = a * exp(b * N) + Offset" with fitted parameters a = 2.5066753490350954E-05, b = 7.2292352155213369E-01, and Offset = 1.6562196782144639E+01 giving RMSE = 0.2542 and R-squared = 0.99999. This combination of data and equation is extremely sensitive to initial parameter estimates. As you can see, it should interpolate with high accuracy within the data range. Since the equation is simple it will likely extrapolate well outside the data range. As I understand your description, if different computer hardware is used or if the cracking algorithm is parallelized then this solution would not match those changes.  On

As this program is used for cracking, it may use some sort of brute force which leads to exponential performance time, so it's much better to find solution as

y = a + b * c^n

for example:

16.38 + 2.01^n / 20000

You can try predicting `log(time)` instead of `time` in `LinearRegression`