Problem
In Python, my task is to find a file (XML)
on an external SFTP
server that contains
<myTag>BLA</myTag>
The problem is there could be more than 5000 files on the server.
Is there a way to do it efficiently?
Right now I am:
- using
pysftp
to connect toSFTP
server - gather all filenames from there and filter them to include only .xml
- go over files with
conn.getfo(file_name, file)
to get the file and search for content.
Pseudo code
import pysftp
cnopts = pysftp.CnOpts()
cnopts.hostkeys = None
conn = pysftp.Connection('...', username='...', password='..', port=22, cnopts=cnopts)
all_file_names = []
for file_attr in conn.listdir_attr(root):
filepath = f'{root}/{file_attr.filename}'.replace('//', '/')
if stat.S_ISREG(file_attr.st_mode):
all_file_names.append(filepath)
elif stat.S_ISDIR(file_attr.st_mode):
if skip_path is None or skip_path != filepath:
all_file_names = all_file_names + get_all_filenames(filepath)
for file_name in [f_name for f_name in all_file_names if f_name.endswith(tuple(['.xml']))]:
with tempfile.TemporaryFile(mode='wb+') as file:
try:
conn.getfo(file_name, file)
file.seek(0)
file.read()
# find right tag...
This code takes around 0.15s per file to open it. Because I have 5000 files this is 12,5min to go over all of them.
Question
How to optimize this?
Prerequisits
- The SFTP is outside of my domain
- I do not have SSH permissions
- There is not a viable solution to download and store all files on my server (file sync)