How can I scrape an image that doesn't have an extension?

285 views Asked by At

Sometimes I come across an image that I can't scrape so that it can be saved. An example of this is:

https://s3.amazonaws.com/plumdistrict.com-production/perks/12321/image/original.?1325898487

When I hit the url from Internet Explorer I see the image but when I try to get it from the code below I get the following error message "System.Net.WebException The remote server returned an error: (403) Forbidden" error with GetResponse:

string url = "https://s3.amazonaws.com/plumdistrict.com-production/perks/12321/image/original.?1325898487";
WebRequest request = WebRequest.Create(url);
WebResponse response = request.GetResponse();

Any ideas on how to get this image?

Edit:

I am able to get to save images that do have extensions. For example I can scrape the following image just fine:

https://s3.amazonaws.com/plumdistrict.com-production/perks/12659/image/original.jpg?1326828951

2

There are 2 answers

0
DigitalMan On

Well, it looks like it's being generated from a script (possibly being retrieved from a database). The server should be sending a file/content type to go along with that... but it doesn't seem to be, which I believe is a violation of standards.

My Linux box knows full well that that's a JPEG image once it's on my hard drive, because it examines file headers rather than relying on extensions. Perhaps there is a tool to do the same in Windows?

Edit: Actually, on further contemplation, it seems odd that you'd get a 403 for that. Perhaps the server is actually blocking you from retrieving the file in that manner.

3
parasietje On

Although HTTP is originally supposed to be stateless, there are a lot of implementations that rely on it being stateless. I could configure my webserver to only accept requests for "http://mydomain.com/sexy_avatar.jpg" if you provide a cookie proving you were logged in. If not, I send you a redirect 303 to "http://mydomain.com/avatar_for_public_use.jpg".

Amazon could be doing the same. Try to load the web page using Chrome, and look at the Network view in developer mode (CTRL+SHIFT+J) to see all headers supplied to the website. Maybe you even need to do a full navigation in the same session before you are allowed to see the image. This is certainly the case in many web applications I have developed :-)