Blog

Force a www Subdomain Using An HTAccess File

August 1st, 2012

There is a well know issue involving the indexing of websites in search engines like Google. The URL http://imperium.ca/ is different from the URL http://www.imperium.ca. We expect them to take us to the same site because they usually do. Unfortunaetly, they could be set up to take you to completely different sites, so search engines like Google need to track them seperately.

As a result, Google may split your page rank across these multiple domains and cause half to be listed as supplemental results[1]. If the pages are deemed to be duplicate content, they could even be dropped from the search engine results all together. The solution is to use a consistent form so that the search engines never track the second domain. Doing a 301 redirect will force a consistent domain and solve your problems.

You can find lots of 301 redirect solutions (like these examples from webconfs.com or these ones from this Google group) on the web that use code or htaccess files. Unfortunately, these examples cannot be generalized for any situation. Each solution is build for a specific domain. HTAccess files are built with the flexibility of a full regular expression engine. By taking advantage of that we can build a script that manages the format of your URL so that the same file can be used for any domain.

HTAccess
HTAccess files are a great way to configure your Apache web server on a per-directory basis. An htaccess file in your root directory will be applied to all subdirectories, but placing one in a subdirectory can overload configurations from parent htaccess files. It’s incredibly easy to maintain. With the mod_rewrite module enabled, it’s incredibly powerful too. But, as I learned, it is also incredibly unintuitive to build or test. I’ll walk you through the steps that I took to redirect all website traffic to adhere to a consistent format.

There are a number of syntax items and terms that you need to understand first. I’ve defined the ones you will use, but be careful. Some terms can mean something different in a different context.

Directives
RewriteCond will specify the conditions that must be met for the script to run the subsequent RewriteRule. It requires an input, and a regular expression, and allows an optional flag.

RewriteRule will do the work to rewrite the URL into a different format. It requires an input, a regular expression, and a flag.

Regular Expression Syntax
A set of square brackets [ ] define a list that the regular expression can match.

A set of circular brackets ( ) define a group of rules that apply a section of an input string. These groups have one other important use; they allow us to capture the input that matches the regular expressions in them and store it to a variable that can be called later.

A string that starts with the ! symbol will only capture strings that do not satisfy the regular expression that follows it.

The ^ symbol denotes the start of a regular expression, unless it is used inside a square bracket [ ]. When used in these, it instead holds the same purpose as the ! character, but relative to the contents of the square brackets.

The $ symbol denotes the end of a string.

The . symbol represents the wildcard. It will match any character.

The helps define special characters. For example, . would match a dot (.) in a string rather than the usual wildcard.

The * symbol will define 0 or more of the previous character.

The + symbol will define 1 or more of the previous character.

Server Variables
A % symbol is used to identify a variable that is specified in curly braces {}. For example, the %{HTTP_HOST} is a variable that holds the URL stripped of the request URI and protocol, while the %{REQUEST_URI} is a variable that only stores the request URI at the end of a URL.

Enable Apache’s mod_rewrite module
The first thing your htaccess file needs to do is enable the mod_rewrite module. As long as your server is configured to allow you to override these options, then all you need to do is add these lines.

Options +FollowSymlinks
RewriteEngine On
RewriteBase /

Force This URL to Have a Subdomain
We want to make sure that a subdomain is always in the URL. But, if some other form of subdomain is already added, we don’t want to overwrite this. So, these rule should capture all URLs that are not empty, and only have a single dot (.) in them, then redirect these URLs to a counterpart that has the ‘www’ subdomain specified.

# don't capture any requests where the HTTP_HOST is empty
# but do capture requests that have exactly 1 dot
# then redirect these requests to a properly formatted variable.
RewriteCond %{HTTP_HOST} !^$
RewriteCond %{HTTP_HOST} ^([^/.]+).([^/.]+)$ [NC]
RewriteRule ^(.*)$ http://www.%1.%2%{REQUEST_URI} [L,R]

The %1 and %2 you see in the last line is the thing that distinguishes this code from most other htaccess files you will find. In the previous RewriteCond we capture the input that matches the regular expressions in the circular brackets. We can then output these for our RewriteRule using the % symbol. The number identifies which variable we want to use (a third set of brackets in the RewriteCond regular expression could be used by adding %3).

Force a trailing backslash
The absence of a trailing backslash in compared to the presence of one can cause the same problems as the absence (or presence) of a subdomain. Obviously, we wouldn’t want to force a URL to a physical file to have a backslash at the end, but we want a uniform way to print our URLs.

# don't capture any requests that are actual files
# but do capture all other requests that do not end in a backslash
# then redirect these requests to instead have a trailing backslash
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_URI} ^.+[^/]$
RewriteRule ^(.+)$ $1/ [L,R]

In the last line, you will notice a $1 in the RewriteRule. This is a lot like the %1 and %2 in the previous code segment that forced a subdomain, but the difference is where the variable is pull from. Using a $ rather than a % will tell the rule to retrieve the variable from the input of the RewriteRule rather than the RewriteCond regular expression. Just like the %, the variable is retrieved from the content that matches the circular brackets.

The Downside
The only issue I have found with this so far is that a mod_rewrite redirect will cause posted data to be lost. The post data doesn’t get reposted when you redirect. The htaccess file just isn’t equipped with the tools to manage them.

I’ve had to go through all of my forms and ensure that the action attribute always points to a properly formatted URL. It is an unfortunate issue because it means that you can’t just drop the file into any web project and assume your issues are over. Furthermore, it can be a difficult issue to find if you are not aware of this problem or have not thoroughly tested your code after implementing this script.

Luckily, if you are implementing this before you start your project, it won’t be much of a problem. Typos in the action attribute can be caught during your usual coding/testing cycles.

The Upside
I’ve already touched on a few advantages in this article:

Your URLs are canonical so search engines like Google know what to index. This will maintain or improve your page rank.
The script is built with flexible regular expressions so that you can put this file into any project without needing to change any code.
In addition to this, we gain a consistency that we can inherently count on. If you’ve ever tried to pull the SERVER_NAME server variable from your code, you would know that this returns whatever your URL specifies. A check against this variable to see what page/domain a user is requesting often forces you to compare against numerous different possible URLs to ensure you have covered all cases. Using this script, you can count on your URLs consistently being in the same expected format.

Additional Resources
While researching this solution, I bookmarked the links that best aided me. If you need to know more about htaccess files so you can customize yours to be tailored to your specific problem, then I recommend reviewing these sources.

Apache Documentation
Data Koncepts
Ranking Labs
mod_rewrite, a beginner’s guide by Neil Crosby
Matt Cutts: Gadgets, Google, and SEO