A variety of user-agents that begin with “Java” are likely visiting your website. Visits providing this type of user-agent are programs created in Java by developers who did not choose to change the default user-agent string value. Here is a list of the Java user-agents I have encountered:
Java/1.4.1_04
Java/1.5.0_02
Java/1.5.0_06
Java/1.5.0_14
Java/1.6.0_02
Java/1.6.0_03
Java/1.6.0_04
Java/1.6.0_07
Java/1.6.0_11
Java/1.6.0_12
Java/1.6.0-oem
I will maintain this list simply for kicks. There is no need to collect an exhaustive list of these user-agent strings in order to block them. As I have mentioned before, I prefer to ban non-human visitors based on a combination of an IP address and a user-agent string.
URL rewrite rules
Here are some URL rewriting conditions and rules that will match a list of IP addresses and any user-agent that begins with “Java” and deliver a 403 Forbidden response for any HTTP request to your server:
RewriteCond %{HTTP_USER_AGENT} Java.*
RewriteRule ^/(.*)$ /$1 [F]
The condition matches any user-agent string that begins with “Java” no matter what comes later. The rewrite rule returns any location that was requested with a 403 Forbidden response code. There will be no change made to the URL and no document delivered.
IIS7 URL Rewrite web.config
<rule name="no-java-bots" stopProcessing="true">
<match url="(.*)" />
<conditions>
<add input="{HTTP_USER_AGENT}" pattern="^Java/.*" />
</conditions>
<action type="AbortRequest" />
</rule>
Why block Java bots?
Bots with a well-defined purpose will typically identify themselves with a unique name. These Java user-agents are either not interested in identifying their purpose or not ready to publish their name and take ownership of the crawling activities. Both cases are a waste of bandwidth. Test your new application on someone else’s website. Play with your shady crawler on someone else’s website. Come back when you are willing to identify yourself.
Comments
23 responses to “How to Block Java user-agents”
Thanks for Java-bot explanations. Recently I too have found such and other robots on a site.
Many robots steal contents of pages of a site, so the decision to block them is correct.
But I do it a little in another way, because
the Rewrite Enginerules rules is not convenient for blocking of ranges of IP-addresses, therefore it does a script on PHP, likely:
$block= array(
“84.120.0.0-84.123.255.255”,
“122.198.0.0-122.198.255.255”,
“205.209.128.0-205.209.191.255”
);
function checkIP($ip) {
for ($i=0; $i= $b_IP && $IP <= $e_IP) return true;
}
return false;
}
“Manually” blocking IP and UserAgent is not the best practice, so I use robots detection by pseudo-picture loading and JavaScrips evaluating. But Java-bots loaded all pseudo-pictures and evaluate JavaScrips!
One way to detect Java-bots – by UserAgent’s field, but it is not so difficult Ñ‚o change this fieled.
What to do in this case?
Yuri, rewrite rules can be implemented to block IP address ranges:
RewriteCond %{REMOTE_ADDR} 213\.93\.196\.\d\d?\d?
RewriteCond %{HTTP_USER_AGENT} Java.*
RewriteRule ^/(.*)$ /$1 [F]
\d represents a single digit in regular expressions, and a question mark ? makes that character optional
[…] access my web site? I’ve also decided to block access to my web site by Java user agents. See How To Block Java User-Agents for someone else’s similar approach to the Java […]
Why single-out Java bots and not Silverlight, Flash, and unknown browsers as well?.
You will realize how stupid your paranoia is when people change the user agent to “MSIE 9, Win7 x64”, and are able to continue crawling your site.
If you place a web site on the open internet, it´s to be accesed by any user agent, not just your preference of browsers.
I say F´You to people like you and your ilk, who don´t have a clue about what the open internet is all about.
FC
Fernando:
Because Java bots clog up my error logs and Java bots are used in SQL injection attacks. When other user agents abuse my websites, I will block them, too.
This isn’t stupid or paranoid. It’s been successful for years; look at the date on this post.
Sure, idiots can change their user-agents, and I can use other criteria to block their malicious intent. There’s no way to escape a server log, and analysis of logs is what helps me create solutions like this.
You are wrong about not being able to choose who accesses my website. I can block whatever I want using mechanisms that are built into any modern web server.
You would be surprised to learn that lots of servers use whitelists to filter traffic, a step further than the blacklists I maintain.
Why should I tolerate attacks when the tools to block the least sophisticated are so easy to use?
You both have valid points and a bit of name calling. I’m curious to who has the stronger case.
There is always some annoying kid that has to start name calling and acting like he is all knowing.
Love the post, it is accurate and useful.
Im testing with this option
RewriteCond %{HTTP_USER_AGENT} Java.*
RewriteRule ^/(.*)$ /$1 [F]
This bot only use 3MG but is increasing
@Corey Thanks for posting this.
I was getting some unusual traffic on my site from user agent “Java/1.6.0_31” and your blog was the first thing on Google search results.
Hi, I am a Java bot and I found this site in searching for why I got blocked. What can I do to appeal this unfair decision and be allowed once again? Thanks
Java bots are being used to attack my servers. They are doing many different attacks. Like the author said above “Come back when you are willing to identify yourself”. I use fail2ban to block unwanted bots. If you dont want to get blocked then just change your User Agent string to something else. Its easy to do. If you decide to perform malicious actions on my website I’ll single out your attack in another way. I have several different ways that bad bots are blocked from my servers. These bots have brought my server to a crawl in the past but not anymore. :)
Re comment: “…not just your preference”
We wouldn’t have websites for long if we let the herds trample at will.
comment: I was looking to see if anyone had a method (Apache) for UA’s using fake combo’s of OS and Browsers.
I only know how to deal with one or two of similar type at a time. Like ^Java or “MSIE 4,5,6”. I’ve been seeing a lot of (Windows; Mac; Chrome; firefox; Safari;) nonsense and not sure how to tackle it
Thanks for the hint.
Many bots are only causing traffic which I have to pay for but give no value back. Some are crawling more then necessary.
So I blog them too.
There are some other bad bot around.
Have you considered setting a trap for the bad bots?
These type of bots do not tend to obey robots.txt or nofollow directives. You can catch them and log them in the following way:
1) Write a script to log IP and User Agent of visitors to the page. Very simple to do in php.
2) Add a link to the logging script URL to the homepage of your site. Make it hidden with CSS, or use a space as the anchor text so human visitors can’t click on it.
3) Add a rel=”nofollow” to the link and disallow it in robots.txt
4) The bad bots will follow the link, their IP and UA will be logged. You can do a whois on the IP and decide whether you want to ban them or not.
Simple!
Sam has a very valid point.
I do something similar, I also have a trap (a honeypot) for these bots.
I give Java bots lower right on my home page, which means
they will be able to read (GEt request) all content, except for form field names (which will be false names in that case) and email addresses (which are also false for them)
But posting: I give them the URL to my honeypot in the action field of the form, so if they do POSt, they will end up in my honey pot…
but else they can crawl my home page all they want and harvest all the content they want…
I actually don’t block bots anymore at all. Bandwidth is a lot cheaper 5 years later, and APIs have become an essential part of my programming life. Some of these bots are companies hitting our API with an improper user-agent.
they can be ligit bots, but in that case they should identify themselves.
If it should be some amateurish search engine bot for example, there is nothing wrong in obsfuscating form field names and email addresses for them, since they will have absolutely nothing to use them for… Google, for example would never index email adresses, and it definitely would never make a POSt.
JAVA user agents has for many years been harvester bots collecting information about a web site, which they then share with thousands of spam bots. It’s why you can some times see lots of POSt to your forms from a lot of IPs, but with no GEt before that, indicating they have actually read the page first.
If you are letting every bot in on your page, then what technique do you use to make sure, you are not getting auto-submitted spam?
All it takes is one legit partner that is too big and inflexible of a company to change their user agent, and your Java bot rules have to have an exception list by IP. It’s not worth it to me anymore.
“Google, for example … would never make a POST”
Google crawls AJAX these days, so that’s a dangerous assumption.
The safest and probably 100% off JavaBots is about to block mode security:
SecAuditEngine On
SecRule REQUEST_HEADERS:User-Agent “@rx java” “phase:1,deny”
SecRule REQUEST_HEADERS:User-Agent “@rx Java” “phase:1,deny”
All the above options can be bypassed by morphic DNS!
Blocking silly user agents can only be cosmetics – since they still appear in the logs, albeit with a 403 response, this merely helps reduce cluttering your logs.
For anyone else who is concerned about subsequent spam, how about booby-trapping your evaluation scripts? Both PHP and CGI/Perl provide the necessary means – and with a properly set up blacklist scheme you can keep those little suckers at bay for a certain period of time.
I for my part am offering some web forms (I don’t even bother with obfuscating field names in the forms), and so far no spammer has come through despite repeated attempts. Instead they regularly run straight into the traps I have laid out for them, which also spares me the incorporation of captchas into the forms – the latter being necessary in order to keep my forms barrier-free.
I kept getting scanned by Java/1.6.0_22 so I blocked them in the robots.txt file and it seemed to do the trick as now I am getting scanned by black/empty user-agents, Java/1.6.0_04 and curl/7.26.0 so would that work properly in the .htaccess as well using [OR] with the first few lines?
An alternative point of view… We are integrating with a company’s SOAP service and the default behavior when initializing the WS client (extending javax.xml,ws.Service) is to make a GET request on the endpoint’s wsdl as the client initializes. The ws client would send a user-agent with a value of java/…
The company’s polity of blocking java traffic made the integration more painful than it needed to be. Either we needed to spoof the user-agent with “foobar java/…” or instead initialize with a static wsdl – and then override the wsdlsoap:address location with the proper URI for the enviornment we are pointing to (e.g. configure the BindProvider after the client was initialized.)
Blocking java traffic is a frustrating response to folks that are legitimately using java to build business services against the sites. Wish there was a better solution that a blanket block.
This post is almost 10 years old, and I have yet to encounter many Java* user agents that were worthwhile. Choose a descriptive name for your crawler, and if you want to identify the software you’re using as Java, leave that bit at the end as you suggest.