Rechercher une page de manuel
visitors
Langue: en
Version: OCTOBER 2004 (mandriva - 01/05/08)
Section: 1 (Commandes utilisateur)
Sommaire
NAME
visitors - process a web log file for visitor statisticsSYNOPSIS
visitors [OPTION]... [--file=FILE]DESCRIPTION
visitors processes a web log file trying very hard to identify a single "person" as much as possible. This is typically achieved by use of either an identifying cookie in the log file; Or via the IP Address/Name & Browser ID combination.Assumes that the logs being sent are already sorted into oldest->most_recent date/time order.
A Berkely Database is used to maintain visitor details between runs; as well as reducing the memory footprint within a run. All files are processed in the order specified.
visitors works best (currently only!) when the apache module "mod_usertrack" is enabled and a cookie entry stored in the resulting logfile.
The numbers produced are not perfect. There is an element of error, but hopefully these numbers are more accurate than alternate methods.
There is one major trick to be aware of; and that is in the separation of old vs new visitors. A new visitor will only be converted to an old visitor in a new run of the program. Which means that if you wish to compute new vs old over a month, you must send an entire months worth of logs to visitors in a single run. Doing multiple runs will not achieve the desired results in this instance.
TERMINOLOGY
Visitor
A visitor is a unique entity that accesses a web site. A single visitor may be several persons (Think Home PC shared with Mum, Dad etc), or none (Search Engine Spider).Visit
A visit is, roughly, the session during which a visitor accesses a web site in a single sitting. The standard, and also used herein, is that a period of no activity of greater than 30 minutes determines a new visit. Additionally, a visit is only counted by page accesses. ie Images and the like are ignored for this calculation.Page
A page is determined herein by what it is not, rather than what it is. The intent being to capture items that produce "primary" information to visitors to the site. eg PDF's, Search Engine results, possibly even archives (zip, tar.gz etc). The default setting is to ignore Images (gif, jpeg, png); Stylesheets (css) and Javascript (js).New Visitor
A new visitor is determined as a visitor that cannot be identified as having visited this site before. Be it via Cookie, or IPAddress/BrowserID combined. On the first ever run, with a fresh database, every visitor will be (probably incorrectly) labelled as a new visitor.Old Visitor
An old visitor is the inverse of the new visitor. Is in fact derived from knowing the total number of visitors, and subtracting the new.LOGIC
The gist is to as accurately as possible identify "person A" as distinct from "person B" and so on. There are several methods currently in use in doing such, without using external trackers, by using the raw logs:Identify via IP Address
is lousy for dealing with corporate gateways, where multiple users all come from the same IP address. Goes the other way when dealing with dispersed proxy servers, such as with AOL: A single user appearing to be multiple users. An additional problem is with, for example, dial-up users who may change their IP address on every new session connection to a web site over time.Identify via IP Address AND Browser ID
Has similar problems with just IP address, but does help distinguish similar users from behind the same IP addressIdentify via Cookie
Well, the big problem here is that a lot of "visitors" (~25% by my reckoning/testing) don't accept cookies. Which, with mod_usertrack, means that every hit gets a brand new cookie logged. Which then means that a single user with cookies disabled can significantly artifically increase the number of real visitors. I've seen a well known, quite expensive!, commercial web log analysis product mis-call ~2000 visitors as 40,000 visitors by incorrectly assuming to this quite elementary issue.Visitors
Gets around some of these problems by combining the latter two methods. Where possible, use the identifying cookie; failing that, fall back to IPAddress/BrowserID combined.- o
- Visitors are ONLY counted by page accesses, not every hit.
- o
- Non-page accesses (eg Images, CSS) are used to verify the correctness or otherwise of a cookie.
- o
- A visit is deemed over after a default period of 30 minutes between page accesses has expired.
OPTIONS
- -c --cache=SIZE
- Set the size of the memory cache to use. Value is in Mb. Default is 20Mb.
- -d --database=FILE
- Change the default database file to use to store stateful data.
- -f --file=FILE
- Web Log File to process. Will use STDIN if not set
- -F --filter=VALUE
- Modify the filter to eliminate what is not a page
- -h --help
- Help data. Very brief.
- -s --simple
- Provide very simple, one line of results. All the other results can be derived from here.
- The column order is: Days Visitors NewVisitors OldVisitors Visits NewVisits OldVisits Pages NewPages OldPages
- These headers will be displayed with a single "-v"
- -t --timeout=VALUE
- Change the visitor timeout from it's default of 30 minutes. VALUE is in seconds (1800 default).
- -v --verbose
- Verboseness of a run. More v's will increase the level of verbosity, up to a maximum of 5.
- -V --version
- Display the version information and exit
- --cleanup=DAYS
- This option is used to help stop the visitors database from growing ad infinitum. All records that have not been seen in the supplied number of days will be deleted. Strongly urge that a backup copy of the database is made before using this option. It will round out the starting day till the end of the current day. This helps avoid timing issues with small delays between query and cleanup.
- --query
- This option is the flag to move visitors into query mode. Wherein existing data can be examined for possibly useful information, or for enabling the cleanup of defunct information.
- --unseen-since=DAYS
- One of the --query options, displays the number of visitors per day not seen in this many days. Set to 365, will display visitors who have not been seen in over a year. Very useful for analysis prior to a "--cleanup" run.
- --access-delta
- One of the --query options, displays information regarding the delta in days between the first visit and the most recent visit. Useful for calculating numbers of long term visitors and their activity levels.
RESULTS
As mentioned earlier, the goal is to display the number of visitors to a site. With this central piece of information we can track other useful statistics. Repeat visitors vs New; Visits; and Page visitations: Repeat vs New.- Days
- The number of days from the first log entry to the last. This will always be a minimum of 1.
- Visitors
- The number of uniquely identified "persons" who visited this site in the given period.
- New Visitors
- Similar to Visitors, but specifically identifies those we have not seen prior to this run.
- Old Visitors
- Inverse of New Visitors. Those we have seen before.
- Visits
EXAMPLES
A typical run, using a database in /tmp/ (/tmp/c.db), and a log file in the current directory (test.log)visitors -d /tmp/c.db -f test.log
Wanting to find how many visitors have not been seen in over a year?
visitors -d /dev/shm/2004.db --query --unseen-since=366
Date Nbr Unseen Visitors
22-May-2004 1234
21-May-2004 5678
20-May-2004 9012
19-May-2004 3456
...
================================
Total: 987654
This means that there have been 987654 recorded visitors who have not visited the site in over a year.
Wanting to see additional information about visitors, particulary across long term data?
visitors -d /dev/shm/2004.db --query --access-delta
Days Nbr Avg Avg
Delta Visitors Visits Pages
0 543210 1.0 5.0
1 1234 2.1 15.1
2 5678 2.2 16.2
3 9012 2.3 17.3
4 3456 2.4 18.4
...
A delta of zero, means the visitor first and last visited on the same day. They may have visited multiple times, or not.
FILES
/usr/local/var/visitors/visitors.db- The database file for retaining state information
BUGS
- o
- Multiple browsers used by the same person on a single machine (probably) register as multiple visitors - when it should actually be but the one.
- o
- Cookie blockers mixed with regular users from behind a common IPAddress/BrowserID may get incorrectly identified. This is a timing problem.
- o
- Assumes that the logs being sent are already sorted into oldest->most_recent date/time order.
- o
- Will only accept a single file via the command line. This is deemed more feature than bug.
AUTHOR
Steve McInerney <spm@stedee.id.au>Contenus ©2006-2024 Benjamin Poulain
Design ©2006-2024 Maxime Vantorre