How to restrict publicly available downloads from users on data repository site?

68 views Asked by At

I'm creating a platform whereby users upload and download data. The amount of data uploaded isn't trivial---this could be on the order of GB.

Users should be able to download a subset of this data via hyperlinks.

If I'm not mistaken, my AWS account will be charged for the egress of downloaded these files. If that's true, I'm concerned about two related scenarios:

  1. Users who abuse this, and constantly click on the download hyperlinks (more than reasonable)
  2. More concerning, robots which would click the download links every few seconds.

I had planned to make the downloads accessible to anyone who visits the website as a public resource. Naturally, if users logged in to the platform, I could easily restrict the amount of data downloaded over a period of time.

For public websites, how could I stop users from downloading too much? Could I use IP addresses maybe?

Any insight appreciated.

2

There are 2 answers

2
Marcin On

IP address can be easily changed. Thus, its a poor control, but probably better than nothing.

For robots, use capcha. This is an effective way of preventing automated scraping of your links.

In addition, you could considered providing access to your links through API gateway. The gateway has throttling limits which you can set (e.g. 10 invocations per minute). This way you can ensure that you will not go over some pre-defined.

On top of this you could use S3 pre-signed URLs. They have expiration time so you could adjust this time to be valid for short time. This also prevents users from sharing links as they would expire after a set time. In this scenario, he users would obtained the S3 pre-signed urls through a lambda function, which would be invoked from API gateway.

0
John Rotenstein On

You basically need to decide whether your files are accessible to everyone in the world (like a normal website), or whether they should only be accessible to logged-in users.

As an example, let's say that you were running a photo-sharing website. Users want their photos to be private, but they want to be able to access their own photos and share selected photos with other specific users. In this case, all content should be kept as private by default. The flow would then be:

  • Users login to the application
  • When a user wants a link to one of their files, or if the application wants to use an <img> tag within an HTML page (eg to show photo thumbnails), the application can generate an Amazon S3 pre-signed URLs, which is a time-limited URL that grants temporary access to a private object
  • The user can follow that link, or the browser can use the link within the HTML page
  • When Amazon S3 receives the pre-signed URL, it verifies that it is correctly created and the expiry time has not been exceeded. If so, it provides access to the file.
  • When a user shares a photo with another user, your application can track this in a database. If a user requests to see a photo for which they have been granted access, the application can generate a pre-signed URL.

It basically means that your application is in control of which users can access which objects stored in Amazon S3.

Alternatively, if you choose to make all content in Amazon S3 publicly accessible, there is no capability to limit the downloads of the files.