A variety of user-agents that begin with “Java” are likely visiting your website. Visits providing this type of user-agent are programs created in Java by developers who did not choose to change the default user-agent string value. Here is a list of the Java user-agents I have encountered:
Java/1.4.1_04
Java/1.5.0_02
Java/1.5.0_06
Java/1.5.0_14
Java/1.6.0_02
Java/1.6.0_03
Java/1.6.0_04
Java/1.6.0_07
Java/1.6.0_11
Java/1.6.0_12
Java/1.6.0-oem
I will maintain this list simply for kicks. There is no need to collect an exhaustive list of these user-agent strings in order to block them. As I have mentioned before, I prefer to ban non-human visitors based on a combination of an IP address and a user-agent string.
URL rewrite rules
Here are some URL rewriting conditions and rules that will match a list of IP addresses and any user-agent that begins with “Java” and deliver a 403 Forbidden response for any HTTP request to your server:
RewriteCond %{HTTP_USER_AGENT} Java.*
RewriteRule ^/(.*)$ /$1 [F]
The condition matches any user-agent string that begins with “Java” no matter what comes later. The rewrite rule returns any location that was requested with a 403 Forbidden response code. There will be no change made to the URL and no document delivered.
IIS7 URL Rewrite web.config
<rule name="no-java-bots" stopProcessing="true">
<match url="(.*)" />
<conditions>
<add input="{HTTP_USER_AGENT}" pattern="^Java/.*" />
</conditions>
<action type="AbortRequest" />
</rule>
Why block Java bots?
Bots with a well-defined purpose will typically identify themselves with a unique name. These Java user-agents are either not interested in identifying their purpose or not ready to publish their name and take ownership of the crawling activities. Both cases are a waste of bandwidth. Test your new application on someone else’s website. Play with your shady crawler on someone else’s website. Come back when you are willing to identify yourself.