I'm creating a platform whereby users upload and download data. The amount of data uploaded isn't trivial---this could be on the order of GB.
Users should be able to download a subset of this data via hyperlinks.
If I'm not mistaken, my AWS account will be charged for the egress of downloaded these files. If that's true, I'm concerned about two related scenarios:
- Users who abuse this, and constantly click on the download hyperlinks (more than reasonable)
- More concerning, robots which would click the download links every few seconds.
I had planned to make the downloads accessible to anyone who visits the website as a public resource. Naturally, if users logged in to the platform, I could easily restrict the amount of data downloaded over a period of time.
For public websites, how could I stop users from downloading too much? Could I use IP addresses maybe?
Any insight appreciated.
IP address can be easily changed. Thus, its a poor control, but probably better than nothing.
For robots, use capcha. This is an effective way of preventing automated scraping of your links.
In addition, you could considered providing access to your links through API gateway. The gateway has throttling limits which you can set (e.g. 10 invocations per minute). This way you can ensure that you will not go over some pre-defined.
On top of this you could use S3 pre-signed URLs. They have expiration time so you could adjust this time to be valid for short time. This also prevents users from sharing links as they would expire after a set time. In this scenario, he users would obtained the S3 pre-signed urls through a lambda function, which would be invoked from API gateway.