Blacklisting via Ionic’s Isapi Rewrite Filter

In IIS, banning IP addresses from accessing a website is fairly easy. I rarely do this, however, because I prefer to use a combination of an IP address and a user agent string to identify bad bots that are likely scraping my content or attempting to harvest email addresses.

I try to avoid blocking an IP address at all costs. IP addresses can be forged and changed, so I prefer to rely on an IP address and user agent string combination to identify the culprit that I want to exile. This approach is not fool proof, but I find it be much more reliable.

Scalability is also an issue. The use of an ISAPI filter to process requests for every website on the server or a single file sure makes life easy. The Microsoft IIS configuration console is a mouse-click nightmare on a server with a couple hundred websites.

I use Ionic’s Isapi Rewrite Filter to change the URL structure of websites to be more search engine friendly. This filter uses the PCRE library, and the use of regular expressions is always a huge plus. The rewriting rules are maintained inside one .ini file, so tweaks and updates are a breeze.

Here is an Ionic’s rewrite rule that will let you block access to every site on your server based upon an IP address and user agent string match. In this particular case, I am blocking an email address harvester with IP 24.132.226.94 and user agent Java/1.6.0-oem.


RewriteCond %{REMOTE_ADDR} 24\.132\.226\.94
RewriteCond %{HTTP_USER_AGENT} Java/1\.6\.0-oem
RewriteRule ^/(.*)$ /$1 [F]

The two conditions on this match use server variables to match the user’s IP address and user agent string to an expression match. The final line is the rewrite rule that matches any file on any website. The [F] flag tells the Ionic’s filter to return an appropriate HTTP status code of 403 Forbidden.

Regular expressions provide the capability to block a range of IP addresses and partial user agent matches. If i wanted to match on any version of this Java-based robot, I could expand the second condition to something like this:


RewriteCond %{HTTP_USER_AGENT} Java/\d.\S*

Similarily, wildcard matches on IP addresses can be used to block ranges of IPs instead of a single address.

The Microsoft vs *NIX server debate will never die. I use both everyday, and I find that the biggest advantage that the open source server environment has over Microsoft is the interface. Using the Ionic’s ISAPI filter allows me to control the URL structure and blacklist for all of my websites easily and efficiently.

I see this method of blocking IPs or blacklisting bots based on IP address and user agent as a great way to simulate an .htaccess approach to the same problem on a Microsoft server.

UPDATE:
As of May 2009, I am using these rules to block these Java bots. I know earlier in this post I favored an IP address and user-agent combination, but my IP address list grew to more than 100 entries before I abandoned that method. There are no useful Java bots. Useful bots have useful names.


RewriteCond %{HTTP_USER_AGENT} Java.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Java.*
RewriteRule ^/(.*)$ /$1 [F]

Published
Categorized as Robots

23 comments

  1. Great, great.

    I’ve got a question if i can : are you the owner of this website ?

    Don’t worry, i am asking you this question because i can see Apache/1.3.34 runing on this website and i’m searching to fake the server name for security reason.

    I know ServerMask who can do that but it don’t work under iis7…

    Sorry for my “apparte”…but if you can anwser, it could be great ;)

    Thanks

  2. Yes, this is my site, and no, there are no tricks going on here. I am a huge WordPress fan, so any blogs I setup are running on Apache.

     

    The Microsoft servers I use are still running IIS 6. Obviously, I am curious about 7’s new URL Rewriting feature, but I hesitate before upgrades like that. Yesterday, I tried to update an XP machine to SP3 and it blue screened during the install.

  3. IIRF can rewrite headers including the “Server:” header that is returned to the requesting client (or browser). IIRF can change the name of the server name, based on the incoming IP, the incoming user agent, the URL, whatever.
    Check it out. http://www.codeplex.com/IIRF

  4. Thanks for your thoughts, Andy. I have thought about Guyty’s comment a lot since he inquired about faking the server header, and it is an interesting approach to security.

     

    Most of the attempts to compromise my sites are input form SQL injection. These attacks can easily be automated, so I am not sure masking the type of server will divert them.

     

    I agree that managing traffic is essential to maintaining websites, and since you shared a post I enjoyed I will return the favor. I hope you are not offended by a few bad words.

     

    Impact On Your Bandwidth Will Be Minimal My Ass
    http://incredibill.blogspot.com/2008/05/impact-on-your-bandwidth-will-be.html

  5. i’m searching “iirf ip list” and came to this post.

    I think this solution is good for just a few IPs.

    My case is that there are about 200 IPs as IP allow list, all other IPs are banned to visit the IIS 6.0. don’t know if there is a good solution in iirf regarding this case, or other ways. thanks any way.

  6. IIRF can have 200 IP’s or more on a whitelist, yes.

    You’d need to chain a set of RewriteCond statements together. Try it out, it’s free.

  7. hi Corey,

    I am currently using IIRF, but I can’t seem to make it work. I would just like make my URL show my domain mysite.com whenever the browsers navigates to mysite.com/home/default.htm. I would really just like to strip the path to make only my domain appear.

    Here is my .ini code:
    RewriteRule ^\/home/default.htm \/

    Can you guide me what am I missing here. FYI, I am currently using IIS6.0. Thanks in advance!

  8. Daniel,

     

    Here are some rules you can try and modify to work for your site:

     

    # match only mysite.com and its subdomains
    RewriteCond %{SERVER_NAME} ([^\.] )\.mysite\.com$ [I]
    # redirect home page to /home/default.htm
    RewriteRule ^/?$ /home/default.htm

  9. Hi Corey,

    Thanks for your reply, I tried your approach but still resulted to error page. Actually I don’t want to redirect the page, I just want to mask it, so if a user browsed to my domain (mysite.com) he/she will not be redirected to (mysite.com/home/default.html)

    My goal is really not to show the /home/default.html path (more like to strip it in the address/URL bar). Is this possible in IIRF? Thanks again.

  10. I tested the rules on one of my domains before publishing them for you. If you are getting an error you may have a configuration problem.

     

    RewriteCond %{SERVER_NAME} ([^\.] )\.mysite\.com$ [I]

     

    This first line means check all requests for this domain. My server has a few hundred sites that are all configured to use the IIRF ISAPI filter (right click website profile properties isapi filters tab). This condition identifies the site I want to modify from the bunch.

     

    RewriteRule ^/?$ /home/default.htm

     

    This rule says match requests on the root for empty string or a slash…

     

    ^ beginning of string
    /? optional one slash
    $ end of string

     

    ..and rewrite that location to /home/default.htm. The result is that any time the root level of the domain is requested, mysite.com or mysite.com/, return the contents of /home/default.htm for that request. The user sees just mysite.com/ in their browser’s address bar, but they are looking at /home/default.htm.

     

    I believe this is what you are trying to accomplish based on your comments.

  11. Hi Corey,

    Thanks again for your reply. Yes that’s what i want to accomplish, to show users only my domain. However, the home/default.html path is not a ‘physical’ file in my server, it is based on MCMS. I think your approach will only work if there is an existing physical file of home/default.html. I saw the generated logs and it is clearly looking for the physical file. Is my goal still possible with my current setup? Kindly advice. Thanks.

  12. Any ideas? It would be great if this is possible, if not, I will have to find some other method to accomplish this. Thanks!

  13. I don’t have any e-mails for them to find on the site, but they keep hitting. Is it really a problem? I’m more worried about the CPU load a lot of extra checking might cause.
    And is that all these Java bots are doing? Maybe they are part of some sort of blog site that lists stuff, or other search/sort type of programs.
    At first I thought maybe they were some sort of cache system for ISPs so they could keep a local copy and save bandwidth on their network.

  14. Jim, it is impossible to predict why Java bots are hitting your sites. Many java developers who do not specify a custom user-agent while designing their crawlers are using these user-agents when requesting documents from your server.

     

    I think we agree that the extra load on our web servers caused by java bots is useless, and blocking these robots is so easy that it simply makes sense.

     

    I am about to update the post with some simplified rewrite rules that I have been using for a couple weeks.

  15. Corey, thanks for the response.

    I found a site “Project Honey Pot” that tracks these things and confirms them by seeing if they spam or not, it might be of use (it’s possible I found the link on this blog somewhere).
    http://www.projecthoneypot.org/harvester_useragents.php

    At this point I’m not getting hit that hard by these “bots”, but if it becomes abusive, to the point it starts taking a lot of CPU then I will have to do something about it and appreciate the info.

    Bandwidth for the few K that they take isn’t a problem (yet). What I am way more worried about is losing some possible indexing that could send traffic to my site.

    All they seem to do so far is load the first level pages and go no deeper, they load no images.

    Yes, they should properly identify themselves, but maybe they don’t want to because people might use that ID to “game” their system.

    If I ever do start to block, I’m probably going to start with the IP list from that “honey pot” place since those are confirmed and see how that goes.

    Plus, if these guys who do this get a clue, they can just change the agent string randomly to any number of known browser types. After that the only way you could tell is by their behavior, which may not be the best indication, or by the honey pot method.

  16. Jim,

     

    These bots were cluttering up my error logging database table for 404 and 500 errors on a server holding about 250 websites.

     

    Like you, I took small steps initially in case any of the bots turned out to be legit. I have found no trace of any negative consequences.

     

    I am due to write a new post about all the different useless user-agents that I am blocking without using an IP address match as well. The amount of crap out there is astounding.

  17. @daniel, yes it’s possible to do what you want. You need a RedirectRule, to redirect from /default/page.htm (in case anyone types it in), and then you need a RewriteRule to rewrite mysite.com to /default/page.htm or whatever it was you wanted to be your default page.

    This is all in the readme.

  18. Hi, I identify robots by looking at my daily logfiles and seeing what they are doing. My major problem is double-sided. I use twitter to output crime news stuff every minute (during a good crime day peak period) making me a prime target. But it is also how I get the most traffic to the landing page (dynamic URL).

    At first it is painful to go through the logfiles everyday but it gets very easy once you have blocked so many of the bots. The pattern is easy to find sorting with Excel. I did write a log analyzer because the free ones were not specific enough for my kind of traffic. But I rather use excel and sort on columns to see the patterns. Because the patterns may change and my script isn’t smart enough to see it.

    What I see mostly from my logfiles are:

    – same IP address hitting at minute intervals to the same URL.
    – several different IP addresses in the range of xxx.xxx.xx.* or xxx.xxx.*.* hitting the server at staggered intervals so in the log file, it looks like a human is hitting the URL. But when looking at the range of ip address also sorted by time, you can tell that its a robot doing the equivelant of a per minute hit to the same URL.

    Honeypotting your pages so the log file reflects how the person is hitting the page is important. That way it supports your guess that what you are seeing is more than likely a robot.

    I also wanted to caution that all robots are not java. There are a lot of languages out there that’s why I filter by behaviour.

    My site is relatively new and so is my use of IIRF. I would like to let one IP address of a range of addresses use my site. Like 127.0.10.11 when I have blocked the range 127.0.10.*. The reason is that I block a range of IP addresses when I see that the robot is using more than three IP addresses. Does anyone know how to do that. Thank you.

  19. debi:

    You can definitely use IIRF to block ranges with exclusions. Here’s an example I wrote for the Wget bot:

    #wget
    RewriteCond %{HTTP_USER_AGENT} Wget.*
    #allow ip range 162.69.226.0 to 162.69.226.24
    RewriteCond %{REMOTE_ADDR} ^(?!162\.69\.226\.([0-9]|1[0-9]|2[0-4]))([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})(.*)$
    #allow 64.170.133.110
    RewriteCond %{REMOTE_ADDR} ^(?!64\.170\.133\.110)([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})(.*)$
    RewriteRule ^/(.*)$ /$1 [F]

    These rules will block all Wget bots except when the IP address matches the exceptions.

Leave a comment

Your email address will not be published. Required fields are marked *