Connection Time in Google Colab Pro+

46 views Asked by At

I am making experiences with mlflow and dagshub, with a public time serie dataset, 3W(https://github.com/petrobras/3W).

I must execute 100 trial with grid search to assess the best set of parameters, using optuna package, and there was some occurrences with the track function.

When I execute the script with less trials in optuna, it is perfect, but it is failing when I need to execute more than 4 (four) hours, or 10 trials, from a code running in Colab with optuna, mlflow and dagshub.
Detail: I upgraded from Colab Pro to Colab Pro+, to execute in background running, where notebooks continue running even after I close the browser tab, as long as I have compute units available.. Now I have 689.98 computing units.

In summary, below we have some trials with fails: sumary of trails with fail

Link to mlflow results: text

And this is the fail message from Colab on the last trail:

%%python drive/MyDrive/Mestrado\\ UFRJ/3W/toolkit/MAIS/training/multiclass/tune_lgbm_colab3.py tune -t drive/MyDrive/Mestrado\\ UFRJ/3W/toolkit/MAIS/dataset/folds -T drive/MyDrive/Mestrado\\ UFRJ/3W/toolkit/MAIS/dataset/folds -e multi_mixed_select_mrl_nonan -n 30

Error message:

[2023-11-17 01:14:47,193 - tune_lgbm_colab - INFO] model.predict [W 2023-11-17 01:15:06,354] Trial 22 failed with parameters: {'level': 10, 'importance_percentile': 0.6893999015182921, 'normal_balance': 4, 'subsample': 0.2, 'feature_fraction': 0.45000000000000007, 'lambda_l1': 4.473491485855139, 'lambda_l2': 0.646775882590484, 'num_leaves': 127} because of the following error: MlflowException('API request to endpoint /api/2.0/mlflow/runs/update failed with error code 400 != 200. Response body: '"repo not associated with run"''). Traceback (most recent call last): File "/content/drive/MyDrive/Mestrado UFRJ/3W/toolkit/MAIS/training/multiclass/tune_lgbm_colab3.py", line 275, in objective log_results(results) File "/content/drive/MyDrive/Mestrado UFRJ/3W/toolkit/MAIS/training/multiclass/tune_lgbm_colab3.py", line 145, in log_results mlflow.log_metric("score-std", np.std(results["scores"])) File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/fluent.py", line 771, in log_metric return MlflowClient().log_metric( File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 766, in log_metric return self._tracking_client.log_metric( File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 298, in log_metric self.store.log_metric(run_id, metric) File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 198, in log_metric self._call_endpoint(LogMetric, req_body) File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 59, in _call_endpoint return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto) File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 210, in call_endpoint response = verify_rest_response(response, endpoint) File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 148, in verify_rest_response raise MlflowException( mlflow.exceptions.MlflowException: API request to endpoint /api/2.0/mlflow/runs/log-metric failed with error code 400 != 200. Response body: '"repo not associated with run"'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 200, in _run_trial value_or_values = func(trial) File "/content/drive/MyDrive/Mestrado UFRJ/3W/toolkit/MAIS/training/multiclass/tune_lgbm_colab3.py", line 270, in objective with mlflow.start_run(nested=True, run_name=f"trial - {trial.number} - cv"): File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/fluent.py", line 190, in exit end_run(RunStatus.to_string(status)) File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/fluent.py", line 446, in end_run MlflowClient().set_terminated(_last_active_run_id, status) File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 1909, in set_terminated self._tracking_client.set_terminated(run_id, status, end_time) File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 575, in set_terminated self.store.update_run_info( File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 151, in update_run_info response_proto = self._call_endpoint(UpdateRun, req_body) File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 59, in _call_endpoint return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto) File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 210, in call_endpoint response = verify_rest_response(response, endpoint) File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 148, in verify_rest_response raise MlflowException( mlflow.exceptions.MlflowException: API request to endpoint /api/2.0/mlflow/runs/update failed with error code 400 != 200. Response body: '"repo not associated with run"' [W 2023-11-17 01:15:06,358] Trial 22 failed with value None. Traceback (most recent call last): File "/content/drive/MyDrive/Mestrado UFRJ/3W/toolkit/MAIS/training/multiclass/tune_lgbm_colab3.py", line 275, in objective log_results(results) File "/content/drive/MyDrive/Mestrado UFRJ/3W/toolkit/MAIS/training/multiclass/tune_lgbm_colab3.py", line 145, in log_results mlflow.log_metric("score-std", np.std(results["scores"])) File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/fluent.py", line 771, in log_metric return MlflowClient().log_metric( File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 766, in log_metric return self._tracking_client.log_metric( File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 298, in log_metric self.store.log_metric(run_id, metric) File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 198, in log_metric self._call_endpoint(LogMetric, req_body) File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 59, in _call_endpoint return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto) File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 210, in call_endpoint response = verify_rest_response(response, endpoint) File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 148, in verify_rest_response raise MlflowException( mlflow.exceptions.MlflowException: API request to endpoint /api/2.0/mlflow/runs/log-metric failed with error code 400 != 200. Response body: '"repo not associated with run"'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/content/drive/MyDrive/Mestrado UFRJ/3W/toolkit/MAIS/training/multiclass/tune_lgbm_colab3.py", line 496, in cli(obj={}) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/decorators.py", line 33, in new_func return f(get_current_context(), *args, **kwargs) File "/content/drive/MyDrive/Mestrado UFRJ/3W/toolkit/MAIS/training/multiclass/tune_lgbm_colab3.py", line 457, in tune study = hyperparameter_search( File "/content/drive/MyDrive/Mestrado UFRJ/3W/toolkit/MAIS/training/multiclass/tune_lgbm_colab3.py", line 300, in hyperparameter_search study.optimize(objective, config["num_trials"], callbacks=[mlflow_callback]) File "/usr/local/lib/python3.10/dist-packages/optuna/study/study.py", line 451, in optimize _optimize( File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 66, in _optimize _optimize_sequential( File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 163, in _optimize_sequential frozen_trial = _run_trial(study, func, catch) File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 251, in _run_trial raise func_err File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 200, in _run_trial value_or_values = func(trial) File "/content/drive/MyDrive/Mestrado UFRJ/3W/toolkit/MAIS/training/multiclass/tune_lgbm_colab3.py", line 270, in objective with mlflow.start_run(nested=True, run_name=f"trial - {trial.number} - cv"): File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/fluent.py", line 190, in exit end_run(RunStatus.to_string(status)) File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/fluent.py", line 446, in end_run MlflowClient().set_terminated(_last_active_run_id, status) File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 1909, in set_terminated self._tracking_client.set_terminated(run_id, status, end_time) File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 575, in set_terminated self.store.update_run_info( File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 151, in update_run_info response_proto = self._call_endpoint(UpdateRun, req_body) File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 59, in _call_endpoint return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto) File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 210, in call_endpoint response = verify_rest_response(response, endpoint) File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 148, in verify_rest_response raise MlflowException( mlflow.exceptions.MlflowException: API request to endpoint /api/2.0/mlflow/runs/update failed with error code 400 != 200. Response body: '"repo not associated with run"'

CalledProcessError Traceback (most recent call last) in <cell line: 1>() 1 get_ipython().run_cell_magic('python', 'drive/MyDrive/Mestrado\ UFRJ/3W/toolkit/MAIS/training/multiclass/tune_lgbm_colab3.py tune -t drive/MyDrive/Mestrado\ UFRJ/3W/toolkit/MAIS/dataset/folds -T drive/MyDrive/Mestrado\ UFRJ/3W/toolkit/MAIS/dataset/folds -e multi_mixed_select_mrl_nonan -n 30', '')

4 frames in shebang(self, line, cell)

/usr/local/lib/python3.10/dist-packages/IPython/core/magics/script.py in shebang(self, line, cell) 243 sys.stderr.flush() 244 if args.raise_error and p.returncode!=0: 245 raise CalledProcessError(p.returncode, cell, output=out, stderr=err) 246 247 def _run_script(self, p, cell, to_close):

CalledProcessError: Command 'b' \n'' returned non-zero exit status 1.

If you could give some light, I am very grateful !!!

I tried to run with less trials and it was ok.

0

There are 0 answers