I am trying to download all the images off of a site, but I'm not sure if this is the best way, as I have tried setting a user agent and referrer to no avail. The 403 Status Error only occurs when trying to download the images from the src page, while the page that has all the images in one place is doesn't show any errors and sends the src to the images. I am not sure if there is a way to download the images without visiting the src page? Or a better way to do this entirely. Here is my code so far.
private static void getPages() throws IOException {
Document doc = Jsoup.connect("https://manganelo.com/chapter/read_bleach_manga_online_for_free2/chapter_686")
.get();
Elements media = doc.getElementsByTag("img");
System.out.println(media);
Iterator<Element> ie = media.iterator();
int i = 1;
while (ie.hasNext()) {
Response resultImageResponse = Jsoup.connect(ie.next().attr("src")).ignoreContentType(true)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0")
.referrer("www.google.com").timeout(120000).execute();
FileOutputStream out = (new FileOutputStream(new java.io.File("image #" + i++ + ".jpg")));
out.write(resultImageResponse.bodyAsBytes());
out.close();
}
}
You have a few problems with your suggested approach:
you're trying to use JSoup to download file content data... JSoup is only for the text data but won't return the image content/values. To download image content you will need an HTTP request
to download the images you also need to copy the request that would be made via a browser. You can open up Chrome, open developer tools and open the network tab. Enter the URL for the page you want to scrape images from, and you'll see a bunch of requests being made. There'll be an individual request for each image somewhere in the view... if you click on the one labelled
1.jpg
you'll see the request made to download the first image, you'll then need to copy all headers that are used to make the request for that image. You'll note, request AND response headers are shown in this view. Once you've replicated the request successfully, you can then start testing which headers/cookies are required. I found the only real requirement was for the "referer" header being necessary.I've stripped out most of what you might need/want but something similar to the below is what you're after. I've pulled the comic book images in their entirety at full quality. I introduced a small sleep timer so as not to overload the server as sometimes you'll get rate limited. Even without it you should be fine but you don't want to get blocked for a lengthy period of time so the slower you can allow the requests to come back to you the better. You could even make the requests in parallel.
You could cut back even more on some of the code below I'm almost certain, to get a cleaner result... but it works and I'm assuming that's more than enough of a result.
Interesting question.
I'd also want to ensure I set the right file extension based on the content type as I believe some were coming back as
.png
format rather than.jpeg
. I'm also fairly sure the write to file can be cleaned up to be simpler/clearer, rather than reading in a byte stream.