Validating HTTP requests using Apache's THE_REQUEST variable

I’m currently experimenting with a few rule conditions to explicitly whitelist the resources I want clients to be able to retrieve on my server. The initial target for this exercise was my onion site which has an issue with misbehaving (poorly written) Tor bots, but I thought it would be fun to extend the experiment to paranoidpenguin.net.

Apache’s THE_REQUEST variable to gives us the full HTTP request line sent by the browser to the server which includes the request method, request uri and the protocol version. Example: GET /category/gnu-linux/ HTTP/2.0

With my WordPress installation, I want to match and allow the following conditions:

  1. HTTP request methods: GET and HEAD.
  2. Request URI: Letters, numbers, dash, underscore and the number sign. An optional condition will cover accessing static resources like multimedia, fonts and so on.
  3. Protocol versions: HTTP/1.1 and HTTP/2.0.

HTTP request method

The first capture group will simply match the GET or HEAD method at the beginning of the line:

^(GET|HEAD)

Request URI

The next capture group requires the resource to begin with a slash and only contain letters, numbers, dash, underscore and the number sign.

The following optional capture group(s) will match static content (files) and allow access to the .well-known folder used by Let’s Encrypt when provisioning SSL certificates.

(/|/[a-zA-Z0-9/\-_#]+)(\.jpe?g|\.gif|\.png|\.woff2?|\.ttf|\.xml(\.gz)?|\.xsl|\.ico|\.css|\.txt|\.html(\.gz)?|\.well-known[a-zA-Z0-9/\-_]+(\.txt)?)?

Protocol versions

The last condition will simply match protocol versions HTTP/1.1 or HTTP/2.0

HTTP/(1\.1|2\.0)

Chaining it all together

Below you’ll find the finalized regular expression as well as an image showing the result when applied against a list of client requests:

^(GET|HEAD)\ (/|/[a-zA-Z0-9/\-_#]+)(\.jpe?g|\.gif|\.png|\.woff2?|\.ttf|\.xml(\.gz)?|\.xsl|\.ico|\.css|\.txt|\.html(\.gz)?|\.well-known[a-zA-Z0-9/\-_]+(\.txt)?)?\ HTTP/(1\.1|2\.0)$
Apache's THE_REQUEST variable

Validating client requests using Apache’s THE_REQUEST variable. Matches are highlighted.

Mod_rewrite

The following mod_rewrite rule will block any request not matching the regex while implementing exceptions for 127.0.0.1 (localhost) and my public IP. Obviously I don’t whitelist 127.0.0.1 for my onion site ;-)

<IfModule mod_rewrite.c>
  RewriteCond %{THE_REQUEST} !^(GET|HEAD)\ (/|/[a-zA-Z0-9/\-_#]+)(\.jpe?g|\.gif|\.png|\.woff2?|\.ttf|\.xml(\.gz)?|\.xsl|\.ico|\.css|\.txt|\.html(\.gz)?|\.well-known[a-zA-Z0-9/\-_]+(\.txt)?)?\ HTTP/(1\.1|2\.0)$
  RewriteCond %{REMOTE_HOST} !^127\.0\.0\.1$
  RewriteCond %{REMOTE_HOST} !^1\.2\.3\.4$
  RewriteRule ^(.*)$ - [L,R=403]
</IfModule>

This is by no means meant as a substitution for a web application firewall (WAF), but it does save some cycles otherwise spent looking for invalid resources on my VPS.

Disclaimer: The regular expression disallows query strings and direct referencing of PHP files. This would likely be an issue for regular WordPress installations depending on configuration, theme and installed plugins. The regular expression is used within a virtual host configuration as opposed to using .htaccess.

Btw, if you’re tempted to run “invalid requests” against blog.paranoidpenguin.net to test my implementation, then please be advised that producing error codes will trigger fail2ban and your IP will (temporarily) become be the latest addition to my firewall.