I use fsspec which uses in-built capabilities of paramiko but could not really find a way how we can paginate the response.
Is there a way to have that functionality over here?
The use-case is like every directory has 100000 files and listing all of these separately in memory is a bad-idea I suppose.
There is a sftp.listdir_iter but do we have that capability in fsspec?
listdir_iter
would provide a more direct way to achieve pagination since it returns an iterator, allowing you to retrieve items one by one.But you could also consider
listdir_attr
, which loads all items at once and then slices the list to get the desired page: that would be faster. That mean you can try and implement the pagination by slicing the returned list ofSFTPAttributes
objects. For example:You would use it as:
This approach is slightly more efficient than the one using
listdir_iter
, since it avoids iterating through the items one by one.However, it still loads all the
SFTPAttributes
objects in memory before slicing the list. This memory overhead might not be an issue unless you have a very large number of files and limited memory resources.To use
listdir_iter
with fsspec, you can create a customPaginatedSFTPFileSystem
class that inherits fromSFTPFileSystem
.The custom class accesses the underlying paramiko SFTP client through the
self.ftp
attribute, and then would still use thelistdir_iter
method directly.By accessing the
paramiko
SFTP client in this way, you can uselistdir_iter
to implement pagination directly, even though it is not part offsspec
.Using
sshfs
(an implementation of fsspec for the SFTP protocol using asyncssh), I do not see aSSHFS.listdir
-like method.But
sshfs
also has a lot of other basic filesystem operations, such asmkdir
,touch
andfind
.You might therefore try and use the
find
method, which is inherited from theAbstractFileSystem
class infsspec
, for pagination:You can use this custom implementation in your project as follows:
This implementation uses the
find
method with the detail parameter set toFalse
to get a list of file paths.Then, it implements pagination by slicing the list of items.
Again, this approach loads all the items into memory before slicing the list, which may be inefficient for very large directories.
I suppose you can pass an existing
SFTPFileSystem
object to your customPaginatedSFTPFileSystem
class and use its underlying sftp connection.To do this, you can modify the custom class to accept an
SFTPFileSystem
object during initialization and use itssftp
attribute for listing the directory items.Now you can create an
SFTPFileSystem
object and pass it to thePaginatedSFTPFileSystem
:This custom class will now use the sftp connection from the existing
SFTPFileSystem
object, eliminating the need to provide thehost
,username
, andpassword
again.Corralien suggests in the comments to use
walk(path, maxdepth=None, topdown=True, **kwargs)
.You can use this method with your custom
PaginatedSFTPFileSystem
class, as it inherits fromSFTPFileSystem
, which in turn inherits fromAbstractFileSystem
.This means that the
walk
method is available to your custom class.However, that might not be the most suitable choice for pagination, as it returns files and directories in a nested structure, making it harder to paginate the results in a straightforward manner.
If you need pagination for only the top-level directories, you can modify the custom
PaginatedSFTPFileSystem
class to include a custom implementation of the walk method with pagination support for the top level.Used with:
Again, that would only paginates the top-level directories and files, not those within the subdirectories.
If you need pagination for files and directories at all levels, consider using the
find
method or the customlistdir_paginated
method, as shown in previous examples.As noted by mdurant in the comments:
See Instance/Listing caching.
Depending on your use-case, you might need to pass
skip_instance_cache=True
oruse_listings_cache=False
.Consider that, if you use the same arguments to create a PaginatedSFTPFileSystem instance, fsspec will return the cached SFTPFileSystem instance.
If you want to force the creation of a new FTP session, you can do so by passing a unique argument when creating the
PaginatedSFTPFileSystem
instance.For example, you can add a dummy argument that takes a unique value each time you want to create a new FTP session:
In that example,
fs1
andfs2
will have separate FTP sessions, despite having the same host, username, and password, because the unique dummy arguments forcefsspec
to create new instances instead of reusing the cached one.