visitors

Langue: en

Autres versions - même langue

Version: OCTOBER 2004 (mandriva - 01/05/08)

Section: 1 (Commandes utilisateur)

NAME

visitors - process a web log file for visitor statistics

SYNOPSIS

visitors [OPTION]... [--file=FILE]

DESCRIPTION

visitors processes a web log file trying very hard to identify a single "person" as much as possible. This is typically achieved by use of either an identifying cookie in the log file; Or via the IP Address/Name & Browser ID combination.

Assumes that the logs being sent are already sorted into oldest->most_recent date/time order.

A Berkely Database is used to maintain visitor details between runs; as well as reducing the memory footprint within a run. All files are processed in the order specified.

visitors works best (currently only!) when the apache module "mod_usertrack" is enabled and a cookie entry stored in the resulting logfile.

The numbers produced are not perfect. There is an element of error, but hopefully these numbers are more accurate than alternate methods.

There is one major trick to be aware of; and that is in the separation of old vs new visitors. A new visitor will only be converted to an old visitor in a new run of the program. Which means that if you wish to compute new vs old over a month, you must send an entire months worth of logs to visitors in a single run. Doing multiple runs will not achieve the desired results in this instance.

TERMINOLOGY

Visitor

A visitor is a unique entity that accesses a web site. A single visitor may be several persons (Think Home PC shared with Mum, Dad etc), or none (Search Engine Spider).

Visit

A visit is, roughly, the session during which a visitor accesses a web site in a single sitting. The standard, and also used herein, is that a period of no activity of greater than 30 minutes determines a new visit. Additionally, a visit is only counted by page accesses. ie Images and the like are ignored for this calculation.

Page

A page is determined herein by what it is not, rather than what it is. The intent being to capture items that produce "primary" information to visitors to the site. eg PDF's, Search Engine results, possibly even archives (zip, tar.gz etc). The default setting is to ignore Images (gif, jpeg, png); Stylesheets (css) and Javascript (js).

New Visitor

A new visitor is determined as a visitor that cannot be identified as having visited this site before. Be it via Cookie, or IPAddress/BrowserID combined. On the first ever run, with a fresh database, every visitor will be (probably incorrectly) labelled as a new visitor.

Old Visitor

An old visitor is the inverse of the new visitor. Is in fact derived from knowing the total number of visitors, and subtracting the new.

LOGIC

The gist is to as accurately as possible identify "person A" as distinct from "person B" and so on. There are several methods currently in use in doing such, without using external trackers, by using the raw logs:

Identify via IP Address

is lousy for dealing with corporate gateways, where multiple users all come from the same IP address. Goes the other way when dealing with dispersed proxy servers, such as with AOL: A single user appearing to be multiple users. An additional problem is with, for example, dial-up users who may change their IP address on every new session connection to a web site over time.

Identify via IP Address AND Browser ID

Has similar problems with just IP address, but does help distinguish similar users from behind the same IP address Well, the big problem here is that a lot of "visitors" (~25% by my reckoning/testing) don't accept cookies. Which, with mod_usertrack, means that every hit gets a brand new cookie logged. Which then means that a single user with cookies disabled can significantly artifically increase the number of real visitors. I've seen a well known, quite expensive!, commercial web log analysis product mis-call ~2000 visitors as 40,000 visitors by incorrectly assuming to this quite elementary issue.

Visitors

Gets around some of these problems by combining the latter two methods. Where possible, use the identifying cookie; failing that, fall back to IPAddress/BrowserID combined.
o
Visitors are ONLY counted by page accesses, not every hit.
o
Non-page accesses (eg Images, CSS) are used to verify the correctness or otherwise of a cookie.
o
A visit is deemed over after a default period of 30 minutes between page accesses has expired.

OPTIONS

-c --cache=SIZE
Set the size of the memory cache to use. Value is in Mb. Default is 20Mb.
-d --database=FILE
Change the default database file to use to store stateful data.
-f --file=FILE
Web Log File to process. Will use STDIN if not set
-F --filter=VALUE
Modify the filter to eliminate what is not a page
-h --help
Help data. Very brief.
-s --simple
Provide very simple, one line of results. All the other results can be derived from here.
The column order is: Days Visitors NewVisitors OldVisitors Visits NewVisits OldVisits Pages NewPages OldPages
These headers will be displayed with a single "-v"
-t --timeout=VALUE
Change the visitor timeout from it's default of 30 minutes. VALUE is in seconds (1800 default).
-v --verbose
Verboseness of a run. More v's will increase the level of verbosity, up to a maximum of 5.
-V --version
Display the version information and exit
--cleanup=DAYS
This option is used to help stop the visitors database from growing ad infinitum. All records that have not been seen in the supplied number of days will be deleted. Strongly urge that a backup copy of the database is made before using this option. It will round out the starting day till the end of the current day. This helps avoid timing issues with small delays between query and cleanup.
--query
This option is the flag to move visitors into query mode. Wherein existing data can be examined for possibly useful information, or for enabling the cleanup of defunct information.
--unseen-since=DAYS
One of the --query options, displays the number of visitors per day not seen in this many days. Set to 365, will display visitors who have not been seen in over a year. Very useful for analysis prior to a "--cleanup" run.
--access-delta
One of the --query options, displays information regarding the delta in days between the first visit and the most recent visit. Useful for calculating numbers of long term visitors and their activity levels.

RESULTS

As mentioned earlier, the goal is to display the number of visitors to a site. With this central piece of information we can track other useful statistics. Repeat visitors vs New; Visits; and Page visitations: Repeat vs New.
Days
The number of days from the first log entry to the last. This will always be a minimum of 1.
Visitors
The number of uniquely identified "persons" who visited this site in the given period.
New Visitors
Similar to Visitors, but specifically identifies those we have not seen prior to this run.
Old Visitors
Inverse of New Visitors. Those we have seen before.
Visits

EXAMPLES

A typical run, using a database in /tmp/ (/tmp/c.db), and a log file in the current directory (test.log)
    visitors -d /tmp/c.db -f test.log

Wanting to find how many visitors have not been seen in over a year?
    visitors -d /dev/shm/2004.db --query --unseen-since=366
        Date         Nbr Unseen Visitors
        22-May-2004                 1234
        21-May-2004                 5678
        20-May-2004                 9012
        19-May-2004                 3456
        ...
        ================================
        Total:                    987654

This means that there have been 987654 recorded visitors who have not visited the site in over a year.


         Wanting to see additional information about visitors, particulary across long term data?
    visitors -d /dev/shm/2004.db --query --access-delta
         Days         Nbr      Avg     Avg
        Delta    Visitors   Visits   Pages
            0      543210      1.0     5.0
            1        1234      2.1    15.1
            2        5678      2.2    16.2
            3        9012      2.3    17.3
            4        3456      2.4    18.4
        ...

A delta of zero, means the visitor first and last visited on the same day. They may have visited multiple times, or not.

FILES

/usr/local/var/visitors/visitors.db
The database file for retaining state information

BUGS

o
Multiple browsers used by the same person on a single machine (probably) register as multiple visitors - when it should actually be but the one.
o
Cookie blockers mixed with regular users from behind a common IPAddress/BrowserID may get incorrectly identified. This is a timing problem.
o
Assumes that the logs being sent are already sorted into oldest->most_recent date/time order.
o
Will only accept a single file via the command line. This is deemed more feature than bug.

AUTHOR

Steve McInerney <spm@stedee.id.au>