There’s no reason not to use mod_deflate

I’ve been trying to convince one of our larger customers to install mod_deflate on their server for about a month. They have had concerns about compatibility with older browsers and the possibility that it will affect the PageRank, but I have finally put enough pressure on to have them let me try it. They have very few users with old browsers (and really, if somebody is running using that archaic of a browser, how likely is it that they are going to buy something on your site), and convinced them that there should be no SEO consequences with the change (if anything, the search engines will respect your site more for using less bandwidth and having a knowledgeable administrator).

Early this morning, I got it installed (took all of about 15 minutes) and it’s running great. It’s compressing HTML and Javascript files to about 20% of their original size, which equates to some significant bandwidth savings, and quicker page-load times. About 60% of the total bandwidth used on this site is for HTML and JavaScript files (the other 40% is images, movies, and a few other odds and ends). Overall, it looks like about a 30-40% drop in total bandwidth usage, which is very significant. I’ve heard of no problems with browser compatibility either, so everybody is happy.

Overall, I’d say that there is no good reason not to use mod_deflate on your site. Especially if you ever get charged for bandwidth overages.

Here are some useful resources for installing and gathering statistics on mod_deflate

Awstats – Detailed Web-Based statistics package
Perl-based mod_deflate statistics utility
Apache’s mod_deflate documentation
Firefox plugin to view the HTTP Response headers
(and lots of other useful stuff)

And here’s a sample Apache configuration section that I picked up from somewhere. I just save this in /etc/httpd/conf.d/deflate.conf and restart apache, then you are good to go (requires that mod_deflate is already compiled and installed. Actual file location may vary, depending on your OS. This works for Red-Hat derivatives)

### Enable mod_deflate to compress output
# Insert filter
SetOutputFilter DEFLATE

# Netscape 4.x has some problems...
BrowserMatch ^Mozilla/4 gzip-only-text/html

# Netscape 4.06-4.08 have some more problems
BrowserMatch ^Mozilla/4\.0[678] no-gzip

# MSIE masquerades as Netscape, but it is fine
BrowserMatch bMSIE !no-gzip !gzip-only-text/html

## Don't compress for IE5.0
BrowserMatch "MSIE 5.0" no-gzip

# Don't compress images
SetEnvIfNoCase Request_URI .(?:gif|jpe?g|png|swf)$ no-gzip dont-vary

# Make sure proxies don't deliver the wrong content
Header append Vary User-Agent env=!dont-vary

## Log some stuff for mod_deflate stats
DeflateFilterNote Input instream
DeflateFilterNote Output outstream
DeflateFilterNote Ratio ratio

LogFormat '"%r" %{outstream}n/%{instream}n (%{ratio}n%%)' deflate
## end mod_deflate stats

#### END mod_deflate configuration

Defeating CAPTCHAs

A recent slashdot article about how Spammers may now be using humans to defeat captchas caught my attention.

Here’s how it would work: Spammers currently use scripts that make automated posts on forums, wikis, blogs, and virtually any other place where user submitted comments may appear on a website. Their posts include links to their “spamvertised” sites where they sell their junk. They benefit both by “advertising” to people who view their automated post, and by trying to trick search engines into generating more links to their site.

Many sites and programs now include CAPTCHA’s which display an image that is supposed to be difficult for a machine to read. The website confirms that the user enters the correct CAPTCHA solution before saving their post.

This article suggests that spammers are now sending these CAPTCHA images to a real human who will input the solution and send it back to the website, therefore allowing the spammer to post their links.

I’ve actually considered this possibility for a while, and it’s not very difficult at all. To prove the concept, I created a simple web service that a spammer’s automated script could post the image to. The service waits while a human types in the result, and then returns that result to the spammers script, which would use it to submit the spam.

If a human completes one of these every 3 seconds, then they could do about 1200 per hour. If you are paying somebody a couple bucks an hour, then it works out to about 0.17 cents (17/100 of a cent) per message. I’m not sure what the going rate for spam is, but this seems pretty reasonable. Twenty bucks would get you 12,000 links to your site.

The concept is incredibly simple — it took me about and hour to write. Try it out here:

http://www.brandonchecketts.com/capdef/

This raises some interesting concerns and questions:

  • The “appeal” of spam is that it has virtually no cost. Since hiring a human introduces a cost, does that mean it won’t get used.
  • Many CAPTCHAS are supposedly easily defeated by computer programs anyway

Some ideas on how to “really” make sure a human is hitting your site:

  • Introduce a minimum time delay between pages. (Ie: a human couldn’t fill out this form in 1 second like a script does)
  • Have some page element (that doesn’t look like it does anything) that “validates” that it has been downloaded. Since scripts will usually just fetch the HTML content, and not the graphics, make one graphic on your page that is really a script (that returns an image). If that image hasn’t been downloaded, than it’s not likely a human visiting.
  • Load your captcha graphic with javascript. Many spammers scripts aren’t able to successfully run javascript.
  • If you use a common captcha-generation program, change the default file name, or form field name.
  • Spammer’s scripts are written to affect the most sites possible. If you make some change on your site, so that it’s not the same as everybody else’s. then automated scripts are less likely to work on your site.

So much for good database design…

I’m installing and modifying WordPress MU (Multi-user) for a client and am amazed at the poor database design. For each blog you set up, it generates 8 new tables for that blog, which have an identical design to the same 8 tables for every other blog it creates. This is extremely poor design according to any database design standards.   Despite the poor design, they do have a good reason for doing it.  Quoted from http://mu.wordpress.org/faq/

Does it scale? (Also: The way you do your databases and tables doesn’t scale!)

WordPress MU creates tables for each blog, which is the system we found worked best for plugin compatibility and scaling after lots of testing and trial and error. This takes advantage of existing OS-level and MySQL query caches and also makes it infinitely easier to segment user data, which is what all services that grow beyond a single box eventually have to do. We’re practical folks, so we’ll use whatever works best, and for the 400k and counting on WordPress.com, MU has been a champ.

The main reason for doing this is that it makes compatibility with existing WordPress plugins much easier. I guess the real source of the problem was poor planning and foresight during the development of the original WordPress application.  They claim it works well, so even though I cringe every time I see it, I guess I’ll just have to live with it and complain.

rrdtool is cool

I’ve spent the last few days working on creating some graphs for my website at http://www.gamepricewatcher.com. I’ve used RRD databases as part of several applications like MRTG and Cacti. I’ve done some modifications to those apps, but never dug too much into the rrdtool specific code because it seemed fairly complicated.

When trying to get some graphs on my site before, I initially tried using Cacti. I had it successfully working, but adding new Games and Servers was extremely cumbersome to do with Cacti. I figured this would be a good time to dig in and learn how to use rrdtool myself to create the graphs. That way I can generate them directly and not have to work with an application that wasn’t designed to do what I wanted.

Fortunately, there are some good documentation and tutorials on how to use rrdtool I found the tutorial from Alex van den Bogaerdt to be the best starting place. It does a good job of explaining the basics. From there, I wrote the data-collection part of my code in Perl and used the RRDTool::OO module to create and update the databases.

To display the graphs, I tried installing the rrdtool php module, but there wasn’t one available for my distribution (CentOS). I eventually got it installed, but found that it is poorly documented, and decided to ditch the module in favor of calling the binary directly.

Now, I have pretty graphs like this on my website:

World of Warcraft Gold Prices
I have been collecting game-wide data for about a year now. However, with the addition of the RRD’s to my site, I’m now tracking server-specific prices in those too. That will make it so that I can eventually provide cross-server comparisons, and compare a specific server’s prices to the game’s average.