- I want to allow image crawling on my site from a couple of different bots and exclude all others.
- I want to allow images in at least one folder to not be blocked for any request.
- I don't want to block image requests from visitors on my own site.
- I don't want to include my domain name in the .htaccess file for portability.
The reason I ask this here and don't simply test the following code myself is that I work on my own and have no colleges to ask or external resources to test from. I think what I've got is correct but I find .htaccess rules extremely confusing, and I don't know what I don't even know at this point.
RewriteCond %{HTTP_REFERER} !^$ [OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?bing\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?facebook\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?google\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?instagram\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?linkedin\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?reddit\..+$ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?twitter\..+$ [NC,OR]
RewriteCond %{REQUEST_URI} !^/cross-origin-resources/ [NC,OR]
RewriteCond %{HTTP_HOST}@@%{HTTP_REFERER} !^([^@]*)@@https?://\1/.* [NC]
RewriteRule \.(bmp|gif|jpe?g|png|webp)$ - [F,L,NC]
I've tested it on htaccess tester, and looks good but does complain about the second last line when tested using the following URL: http://www.example.co.uk/poignant/foo.webp
You have the logic in reverse. As written these conditions (
RewriteConddirectives) will always be successful and the request will always be blocked.You have a series of negated conditions that are OR'd. These would only fail (ie. not block the request) if all the conditions match, which is impossible. (eg. The
Refererheader cannot bebingandfacebook.)You need to remove the
ORflag on all yourRewriteConddirectives, so they are implicitly AND'd.Incidentally, the suggestion in comments from @StephenOstermiller to combine the
HTTP_REFERERchecks into one (which is a good one) is the equivalent to having the individual conditions AND'd, notOR'd (as you have posted initially).Once you've corrected the
OR/AND as stated above, this rule will likely allow ALL bots to crawl your site images because bots generally do not send aRefererheader. These directives are not really about "crawling", they allow certain websites to display your images on their domain (ie. hotlinking). This is probably the intention, however, it's not what you are stating in point #1.(To block bots from crawling your site you would need to check the
User-Agentrequest header, ie.HTTP_USER_AGENT- which would probably be better done in a separate rule.)Minor point, but the
+$at the end of the regex is superfluous. There's no need to match the entireRefererwhen you are only interested in the hostname. Although these sites probably have a Referrer-Policy set that prevents the URL-path being sent (by the browser) in theRefererheader anyway, but it is still unnecessary.In comments, you were asking what this line does. This satisfies points #3 and #4 in your list, so it is certainly needed. It ensures that the requested
Hostheader (HTTP_HOST) matches the hostname in theReferer. So the request is coming from the same site.The alternative is to hardcode your domain in the condition, which you are trying to avoid.
(Again, the trailing
.*on the regex is unnecessary and should be removed.)This is achieved by using an internal backreference
\1in the regex against theHTTP_REFERERthat matchesHTTP_HOSTin the TestString (first argument). The@@string is just an arbitrary string that does not occur in theHTTP_HOSTorHTTP_REFERERserver variables.This is clearer if you expand the TestString to see what is being matched. For example, if you make an internal request to
https://example.com/myimage.jpgfrom your homepage (ie.https://example.com/) then the TestString in theRewriteConddirective is:This is then matched against the regex
^([^@]*)@@https?://\1/(the!prefix on the CondPattern is an operator and is part of the argument, not the regex).([^@]*)- the first capturing group capturesexample.com(The value ofHTTP_HOST).@@https?://- simply matches@@https://in the TestString (part of theHTTP_REFERER).\1- this is an internal backreference. So this must match the value captured from the first capturing group (#1 above). In this example, it must matchexample.com. And it does, so there is a successful match.!prefix on the CondPattern (not strictly part of the regex), negates the whole expression, so the condition is successful when the regex does not match.So, in the above example, the regex matches and so the condition fails (because it's negated), so the rule is not triggered and the request is not blocked.
However, if a request is made to
https://example.com/myimage.jpgfrom an external site, eg.https://external-site.example/then the TestString in theRewriteConddirective is:Following the steps above, the regex fails to match (because
external-site.exampledoes not matchexample.com). The negated condition is therefore successful and the rule is triggered, so the request is blocked. (Unless one of the other conditions failed.)Note that with the condition as written,
www.example.comis different toexample.com. For example, if you were onexample.comand you used an absolute URL to your image usingwww.example.comthen the regex will fail to match and the request will be blocked. This could perhaps be incorporated into the regex, to allow for this. But this is very much an edge case and can be avoided with a canonical 301 redirect earlier in the config.This allows an empty (or not present)
Refererheader. You "probably" do need this. It allows bots to crawl your images. It permits direct requests to images. It also allows users who have chosen to suppress theRefererheader to be able to view your images on your site.HOWEVER, it's also possible these days for a site to set a Referrer-Policy that completely suppresses the
Refererheader being sent (by the browser) and so bypasses your hotlink protection.Minor point, but the
Lflag is not required when theFflag is used (it is implied).Are you really serving
.bmpimages?!Aside: Sites don't necessarily "hotlink"
Some of these external sites (bing, Facebook, Google, Instagram, LinkedIn, Reddit, twitter, etc.) don't necessarily "hotlink" images anyway. They often make their own (resized/compressed) "copy" of the image instead (a bot makes the initial request to retrieve the image - with no
Referer- so the request is not blocked).So, explicitly permitting some of these sites in your "hotlink-protection" script might not be necessary anyway.
Summary
Taking the above points into consideration, the directives should look more like this: