Category: Robots

Web crawling robots are pieces of software that copy and manipulate the data available on the world wide web. I run many websites, and therefore have a love/hate relationship with the robots that reach out to me each day.

  • Allow robots to crawl your wp-content folder

    An alternate title for this post could be, “How disallowing robots from your wp-content folder could cost you mobile rankings in Google.”

    On April 21st, 2015, Google is going to change the way it ranks sites for users on mobile devices. By blocking Googlebot from your plugins folder, you could be preventing Google from deciding that your site is mobile-friendly. If you are skeptical about this Google-is-changing statement I have made or want to dive into the details, read this.

    So, why?

    Why does Google need to crawl your plugins folder? Plugins often contain CSS or JS files, and those files are necessary to understand what the page actually looks like. Google Webmaster tools told me I was preventing Googlebot from crawling some CSS files in which it was interested. Robots need to download all CSS and JavaScript files or they cannot determine if a page is friendly to mobile users.

    I found this line in my client’s robots.txt:
    Disallow: /wp-content/plugins/

    Why would this line be in robots.txt at all? My client lives on GoDaddy Managed WordPress Hosting, and that service creates a robots.txt file that looks like this (as of the date I published this post):

    User-agent: *
    Crawl-delay: 1
    Disallow: /wp-content/plugins/
    Disallow: /wp-admin/

    There are a bunch of blogs that discuss the “ideal WordPress robots.txt file” that recommend blocking the plugins folder, and some plugins alter robots.txt to block this directory, too. Before February 2015, even Yoast SEO did this. It’s no longer a good idea.

  • How to Block Java user-agents

    A variety of user-agents that begin with “Java” are likely visiting your website. Visits providing this type of user-agent are programs created in Java by developers who did not choose to change the default user-agent string value. Here is a list of the Java user-agents I have encountered:


    Java/1.4.1_04
    Java/1.5.0_02
    Java/1.5.0_06
    Java/1.5.0_14
    Java/1.6.0_02
    Java/1.6.0_03
    Java/1.6.0_04
    Java/1.6.0_07
    Java/1.6.0_11
    Java/1.6.0_12
    Java/1.6.0-oem

    I will maintain this list simply for kicks. There is no need to collect an exhaustive list of these user-agent strings in order to block them. As I have mentioned before, I prefer to ban non-human visitors based on a combination of an IP address and a user-agent string.

    URL rewrite rules

    Here are some URL rewriting conditions and rules that will match a list of IP addresses and any user-agent that begins with “Java” and deliver a 403 Forbidden response for any HTTP request to your server:


    RewriteCond %{HTTP_USER_AGENT} Java.*
    RewriteRule ^/(.*)$ /$1 [F]

    The condition matches any user-agent string that begins with “Java” no matter what comes later. The rewrite rule returns any location that was requested with a 403 Forbidden response code. There will be no change made to the URL and no document delivered.

    IIS7 URL Rewrite web.config

    
    <rule name="no-java-bots" stopProcessing="true">
        <match url="(.*)" />
        <conditions>
    	<add input="{HTTP_USER_AGENT}" pattern="^Java/.*" />
        </conditions>
        <action type="AbortRequest" />
    </rule>
    

    Why block Java bots?

    Bots with a well-defined purpose will typically identify themselves with a unique name. These Java user-agents are either not interested in identifying their purpose or not ready to publish their name and take ownership of the crawling activities. Both cases are a waste of bandwidth. Test your new application on someone else’s website. Play with your shady crawler on someone else’s website. Come back when you are willing to identify yourself.

  • Blacklisting via Ionic’s Isapi Rewrite Filter

    In IIS, banning IP addresses from accessing a website is fairly easy. I rarely do this, however, because I prefer to use a combination of an IP address and a user agent string to identify bad bots that are likely scraping my content or attempting to harvest email addresses.

    I try to avoid blocking an IP address at all costs. IP addresses can be forged and changed, so I prefer to rely on an IP address and user agent string combination to identify the culprit that I want to exile. This approach is not fool proof, but I find it be much more reliable.

    Scalability is also an issue. The use of an ISAPI filter to process requests for every website on the server or a single file sure makes life easy. The Microsoft IIS configuration console is a mouse-click nightmare on a server with a couple hundred websites.

    I use Ionic’s Isapi Rewrite Filter to change the URL structure of websites to be more search engine friendly. This filter uses the PCRE library, and the use of regular expressions is always a huge plus. The rewriting rules are maintained inside one .ini file, so tweaks and updates are a breeze.

    Here is an Ionic’s rewrite rule that will let you block access to every site on your server based upon an IP address and user agent string match. In this particular case, I am blocking an email address harvester with IP 24.132.226.94 and user agent Java/1.6.0-oem.


    RewriteCond %{REMOTE_ADDR} 24\.132\.226\.94
    RewriteCond %{HTTP_USER_AGENT} Java/1\.6\.0-oem
    RewriteRule ^/(.*)$ /$1 [F]

    The two conditions on this match use server variables to match the user’s IP address and user agent string to an expression match. The final line is the rewrite rule that matches any file on any website. The [F] flag tells the Ionic’s filter to return an appropriate HTTP status code of 403 Forbidden.

    Regular expressions provide the capability to block a range of IP addresses and partial user agent matches. If i wanted to match on any version of this Java-based robot, I could expand the second condition to something like this:


    RewriteCond %{HTTP_USER_AGENT} Java/\d.\S*

    Similarily, wildcard matches on IP addresses can be used to block ranges of IPs instead of a single address.

    The Microsoft vs *NIX server debate will never die. I use both everyday, and I find that the biggest advantage that the open source server environment has over Microsoft is the interface. Using the Ionic’s ISAPI filter allows me to control the URL structure and blacklist for all of my websites easily and efficiently.

    I see this method of blocking IPs or blacklisting bots based on IP address and user agent as a great way to simulate an .htaccess approach to the same problem on a Microsoft server.

    UPDATE:
    As of May 2009, I am using these rules to block these Java bots. I know earlier in this post I favored an IP address and user-agent combination, but my IP address list grew to more than 100 entries before I abandoned that method. There are no useful Java bots. Useful bots have useful names.


    RewriteCond %{HTTP_USER_AGENT} Java.* [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Java.*
    RewriteRule ^/(.*)$ /$1 [F]