Is there a way to restrict sklearn from downloading datasets?

120 views Asked by At

With reference to the latest security issue related to tar file - https://www.theregister.com/2022/09/22/python_vulnerability_tarfile/

we are using the Creosote tool - https://github.com/advanced-threat-research/Creosote

to check if there is any vulnerability in the code and in the packages installed in the python virtual environment.

The following is the report generated by the Creosote tool:

 ::::::::  :::::::::  :::::::::: ::::::::   ::::::::   :::::::: ::::::::::: :::::::::: 
:+:    :+: :+:    :+: :+:       :+:    :+: :+:    :+: :+:    :+:    :+:     :+:        
+:+        +:+    +:+ +:+       +:+    +:+ +:+        +:+    +:+    +:+     +:+        
+#+        +#++:++#:  +#++:++#  +#+    +:+ +#++:++#++ +#+    +:+    +#+     +#++:++#   
+#+        +#+    +#+ +#+       +#+    +#+        +#+ +#+    +#+    +#+     +#+        
#+#    #+# #+#    #+# #+#       #+#    #+# #+#    #+# #+#    #+#    #+#     #+#        
 ########  ###    ### ########## ########   ########   ########     ###     ########## 
 
Starting scan of:venv/
        Scanning for Vulnerabilities:
                Error reading file:venv/lib/python3.10/site-packages/joblib/test/test_func_inspect_special_encoding.py
                        'utf-8' codec can't decode byte 0xa4 in position 64: invalid start byte
                Scan Completed

4 files with vulns:     0 vulns, 0 probable vulns, and 4 potential vulns found
        venv/lib/python3.10/site-packages/pip/_vendor/distlib/util.py
                Found potential vulns on lines: 1252
        venv/lib/python3.10/site-packages/sklearn/datasets/_lfw.py
                Found potential vulns on lines: 111
        venv/lib/python3.10/site-packages/sklearn/datasets/_twenty_newsgroups.py
                Found potential vulns on lines: 77
        venv/lib/python3.10/site-packages/dateutil/zoneinfo/rebuild.py
                Found potential vulns on lines: 24

As you can see the report flags out potential vulnerability in the sklearn/datasets sub package. Is there a way to restrict sklearn from downloading it?

Or in general, how to fix this vulnerability to avoid any production issues?

1

There are 1 answers

0
Alexander L. Hayes On

scikit-learn does not download datasets by default. So there are a few options.

Option 0: The exploit looks like it requires administrator privileges. Avoid: sudo python something.py

Option 1: Don't run this code:

from sklearn.datasets import fetch_lfw_people, fetch_20newsgroups
lfw = fetch_lfw_people()
news = fetch_20newsgroups()

Option 2: I'm not familiar with Creosote, but there are uses of tarfile in scikit-learn that do not appear to have been flagged. e.g.: fetch_california_housing. If some fetch_ methods have potential vulnerabilities, these should be debugged and patched upstream.

Option 3: If the existence of this code in the package is considered dangerous for your organization: modify and build wheels that comply with your organization's security policies.