Thursday, May 27, 2010

Apache Log Reporter (apache, lighttpd, thttpd)

If you support, design for or own a web server you are always interested in the users and abusers of your server. You need to keep an eye on how many clients hit your site, how much bandwidth did they use and what pages are most popular. Tools like Webalizer are excellent for this task, but you might be looking for a smaller report to email to a Blackberry or other device. How about a text based report that is less than 2.5 kilobytes and summarizes your web logs for the last few days?




What does it do?

This perl script will summarize your web logs from Apache, Lighttpd, tHttpd or any other web server which reports its logs in standard apache format. The output is designed to be mailed out to web admins and those concerned about web operations and is friendly to the small screens of devices like the Blackberry.
The report below will report:
  • Top 10 requesters
  • Top 10 by volume downloaded
  • Top 10 URLs requested
  • Top 10 URLs per host
  • Number of requests per status class


What does the report look like?

Lets take a look at the output of the report so you can see what it is about before you get into the simple install.
From: root@YOUR_HOST.com (your name)
To: root@YOUR_HOST.com
Date: Mon, 10 Jan 2010 10:20:30 -0600 (EDT)
Subject: your_hostname.domain.net web report

Analysis of log records between:
Mon Jan 01 10:20:30 2010 and
Wed Jan 07 20:30:40 2010

Top 10 requesters:
  135.43.211.12         19,482 requests (free-stuuf.grepper.com)
  10.3.54.2              7,902 requests (someguy.fredom.com)
  134.235.27               704 requests (noone.tester.net)
  54.23.124.45             546 requests (who.what.org)
  145.34.2.3               508 requests (telus.nt.net)
  78.45.3.25               288 requests (dhcp.freeisp.com)
  66.249.67.20             248 requests (crawl-66-249-67-20.googlebot.com)
  34.23.56.3               146 requests (tgresde.domain.info)
  90.24.170.184             92 requests (ATuileries-153-1-99-194.w90-24.abo.wanadoo.fr)
  194.140.147.8             86 requests (mx.superinfo.com)

Top 10 by volume downloaded:                                                                      
  135.43.211.12        102,607,182 bytes (free-stuuf.grepper.com)
  10.3.54.2             61,172,188 bytes (someguy.fredom.com)
  54.23.124.45          11,699,030 bytes (who.what.org)
  34.213.34.21           4,105,354 bytes (greg.desrtr.net)
  66.249.67.20             793,926 bytes (crawl-66-249-67-20.googlebot.com)
  194.140.147.8            760,164 bytes (mx.superinfo.com)
  88.177.248.14            536,782 bytes (def92-12-88-177-248-14.fbx.prexad.net)
  128.30.52.53             528,950 bytes (lovejay.w3.org)
  90.24.170.194            523,726 bytes (ATuileries-123-1-99-194.w90-24.abo.wanadoo.fr)
  88.167.12.47             507,112 bytes (per87-1-88-167-12-47.fbx.prexad.net)

Top 10 URLs requested:
         9,086 /favicon.ico
         4,902 /your.css
         3,656 /some_pic.jpg
         3,618 /another_pic.jpg
         2,996 /happy.jpg
         2,794 /
         1,040 /big_file.html
           694 /frdfg.html
           338 /wonder.html
           332 /grted.html

Top 10 URLs per host:
  2214 10.3.54.2       /favicon.ico (someguy.fredom.com)
  1186 10.3.54.2       /your.css (someguy.fredom.com)
  1124 54.23.124.45    /some_pic.jpg (who.what.org)
  1116 54.23.124.45    /another_pic.jpg (who.what.org)
  1094 54.23.124.45    /happy.jpg (who.what.org)
   994 128.30.52.53    /happy.jpg (lovejay.w3.org)
   958 194.140.147.8   /wonder.html (mx.superinfo.com)
   932 194.140.147.8   / (mx.superinfo.com)
   910 88.167.12.47    /wonder2.html (per87-1-88-167-12-47.fbx.prexad.net)
   724 88.177.248.14   / (def92-12-88-177-248-14.fbx.prexad.net)

Number of requests per status class:
 200           528,678
 300            01,408
 400               140
 500                10




If the output above looks like something you can use then lets get started on setting it up for your environment. Three steps and about five(5) minutes of your time.



Need help setting up Apache for speed and security? Make sure to check out our Apache Web Server "how to". We provide explanations and fully working examples.


Starting the Install

Step 1: is getting the script and looking at the options. Below you can download the calomel_web_report.pl as a file and you can also browse the same script in a scrollable text window. Both are provided so you can easily review the Perl script.
You can download calomel_web_report.pl here by doing a "save as" or just clicking on the link and choosing download. Before using the config file take a look it below or download it and look at the options. Calomel.org web_report.pl
#!/usr/bin/perl
#
#######################################################
###  Calomel.org web_report.pl  BEGIN
#######################################################

use Time::Local;

my $logdir = '/var/log/web_server';

opendir D,$logdir or die "Could not open $logdir ($!)";
@logfiles = sort grep /^access.log/, readdir D;
closedir D;

# Just use the 6 most recently archived log files.
shift @logfiles while @logfiles > 6;

my (%host, %url, %status, %urlsperhost);
my ($mintime,$maxtime) = (10_000_000_000, 0);
my %mon = qw/Jan 0 Feb 1 Mar 2 Apr 3 May  4 Jun  5
             Jul 6 Aug 7 Sep 8 Oct 9 Nov 10 Dec 11/;

foreach my $f (@logfiles,'access.log'){
  $logdir = '/var/log/lighttpd' if $f eq 'access.log';
  open F,"$logdir/$f" or die "Could not open $logdir/$f ($!)";
  while(){
    my ($host, $ident_user, $auth_user, $day,$mon,$year, $hour,$min,$sec,
    $time_zone, $method, $url, $protocol, $status,
    $bytes, $referer, $agent) =
    /                 # regexp begins
    ^               # beginning-of-string anchor
    (\S+)           # assigned to $host
    \               # literal space
    (\S+)           # assigned to $ident_user
    \               # literal space
    (\S+)           # assigned to $auth_user
    \               # literal space
    \[              # literal left bracket
    (\d\d)          # assigned to $day
    \/              # literal solidus
    ([A-Z][a-z]{2}) # assigned to $mon
    \/              # literal solidus
    (\d{4})         # assigned to $year
    :               # literal colon
    (\d\d)          # assigned to $hour
    :               # literal colon
    (\d\d)          # assigned to $min
    :               # literal colon
    (\d\d)          # assigned to $sec
    \               # literal space
    ([^\]]+)        # assigned to $time_zone
    \]\ "           # literal string '] "'
    (\S+)           # assigned to $method
    \               # literal space
    (.+?)           # assigned to $url
    \               # literal space
    (\S+)           # assigned to $protocol
    "\              # literal string '" '
    (\S+)           # assigned to $status
    \               # literal space
    (\S+)           # assigned to $bytes
    \               # literal space
    "([^"]+)"       # assigned to $referer
    \               # literal space
    "([^"]+)"       # assigned to $agent
    $               # end-of-string anchor
    /x              # regexp ends, with x modifier
    or next;

    $host eq '::1' and next; # Ignore Apache generated requests from localhost.

    $bytes =~ /^\d+$/ or $bytes = 0;

    $host{$host}++;
    $bytesperhost{$host} += $bytes;
    $url{$url}++;
    $status_class = int($status/100) . '00';
    $status{$status_class}++;
    $urlsperhost{"$host $url"}++;

    # Parse the $time_zone variable.
    my $tz = 0;
    my ($tzs,$tzh,$tzm) = $time_zone =~ /([\-+ ])(\d\d)(\d\d)/;
    if(defined $tzs){
      $tzs = $tzs eq '-' ? 1 : -1;
      $tz = $tzs * (3600*$tzh + 60*$tzm);
    }

    my $time = timegm($sec,$min,$hour,$day,$mon{$mon},$year-1900) + $tz;
    $mintime = $time if $time < $mintime;
    $maxtime = $time if $time > $maxtime;
  }
  close F;
}

my $start = localtime $mintime;
my $end   = localtime $maxtime;

print "Analysis of log records between:\n$start and\n$end\n\n";

my %dns;

my @toprequestors = (sort { $host{$b} <=> $host{$a} } keys %host)[0..9];
print "Top 10 requesters:\n";
foreach my $host (@toprequestors){
  my $name = dns($host);
  printf "  %-15s %12s requests$name\n",$host,add_commas($host{$host});
}

print "\n";

my @topvolume =
(sort { $bytesperhost{$b} <=> $bytesperhost{$a} } keys %bytesperhost)[0..9];
print "Top 10 by volume downloaded:\n";
foreach my $host (@topvolume){
  my $name = dns($host);
  printf "  %-15s %16s bytes$name\n",$host,add_commas($bytesperhost{$host});
}

print "\n";

my @topurls = (sort { $url{$b} <=> $url{$a} } keys %url)[0..9];
print "Top 10 URLs requested:\n";
foreach my $url (@topurls){
  printf "  %12s $url\n",add_commas($url{$url});
}

print "\n";

my @topurlsperhost =
(sort { $urlsperhost{$b} <=> $urlsperhost{$a} } keys %urlsperhost)[0..9];
print "Top 10 URLs per host:\n";
foreach my $hosturl (@topurlsperhost){
  my ($host,$url) = split " ",$hosturl;
  my $name = dns($host);
  printf "  %4d %-15s $url$name\n",$urlsperhost{$hosturl},$host;
}

print "\n";

print "Number of requests per status class:\n";
foreach my $class (sort {$a <=> $b} keys %status){
  printf "%4d  %16s\n",$class,add_commas($status{$class});
}

sub dns{
  my $ip = shift;
  return $dns{$ip} if defined $dns{$ip} && $dns{$ip};
  my $lookup = `/usr/sbin/host $ip 2>/dev/null`;
  my $name;
  if($lookup =~ /NXDOMAIN/
  or $lookup =~ /SERVFAIL/
  or $lookup =~ /timed out/
  ){
    $name = '';
  }
  else{
    $name = (split ' ',$lookup)[-1];
    $name =~ s/\.$//;
    $name = " ($name)";
  }
  $dns{$ip} = $name if $name;
  $name;
}

sub add_commas{
  # Add commas to a number string (e.g. 1357924683 => 1,357,924,683)
  my $num = reverse shift;
  $num =~ s/(...)/$1,/g;
  chop $num if $num =~ /,$/;
  $num = reverse $num;
}
#######################################################
###  Calomel.org  calomel_web_report.pl  END
#######################################################


Step 2: The only option in the script is telling it where to find your log directory. In our example the logs at in the /var/log/web_server directory. This directory contains all of the access_log files we are looking for. This is the ninth(9th) line at the top of the script your are looking for:
my $logdir = '/var/log/web_server';


Step 3: Now that you have the script and you edited the $logdir directive to tell the script where to look for the logs it is time to setup a cron job to run it. You may find that a cron job run once in the morning and once before the end of the working day will be most beneficial. This an example cron job running the calomel_web_report.pl script in the /tools at 8am and 5pm every day to root.
#minute (0-59)
#|   hour (0-23)
#|   |    day of the month (1-31)
#|   |    |   month of the year (1-12 or Jan-Dec)
#|   |    |   |   day of the week (0-6 with 0=Sun or Sun-Sat)
#|   |    |   |   |   commands
#|   |    |   |   |   |
#### Calomel.org Web Report (cron job)
00   8,17 *   *   *   /tools/calomel_web_report.pl | mail -s "`hostname` web report" root




In Conclusion

Thats it. Now you can receive summarized reports by email from the mail server even on portable devices. There is also nothing wrong with looking at this report on the desktop as it gives an informed view of web server access over the last few days.

No comments: