|
Home > Products
& Services > ISSEL Log Companion
ISSEL Log Companion Whitepaper
by Björn Svensson, Technical Director, ISSEL
The reason for pre-processing log files is either to simplify
processing or to achieve something that cannot be handled
by the Web Traffic Analysis software which is to analyze the
data. This can include:
- ABCE compliance
- Load Balanced servers
- Log Cleansing
- Adding and transforming information
- Transforming log files
- Automation
- Bug fixing
- Customisation
Screen Shots
Log
Companion Enterprise
Log
Companion Enterprise with open Config
Log
Companion Pro (PDF)
ABCE Compliance
ABC Electronics audits web site traffic. This is particularly
important for advertisers and investors so that traffic to
different sites can be correctly compared. To achieve this
ABCE applies standards for what can be counted and how it
is counted. More information about these rules and web site
auditing can be found at ABC Electronic's web site: http://www.abce.org.uk.
Some of the rules that ABCE requires are not handled by most
common Web Traffic analysis tools. For example, ABCE does
not allow traffic generated by robots indexing the site to
be counted for auditing purposes. To enable fair comparison
between different sites ABCE keeps a regularly updated list
of robots which must be excluded. The list is available from
the ABCE site and comprises around 300 different robot User-Agents.
Most analysis tools cannot handle this volume of exclusions
whereas the Log Companion can handle it with ease. Indeed,
the exclusion list can be downloaded from the ABCE site and
used directly with the Log Companion.
ABCE also enforces other special requirements which the Log
Companion can handle:
- exclude requests where the User-Agent field is empty
- exclude all requests which are not GET or POST
- exclude requests with error and re-direction Status codes
- exclude all requests which are not directly requested
by the user
The last entry is part of the core of ABCE compliance. ABCE
counts Page Impressions which are defined as a unique page
presented to the viewer. For each click the user makes on
the site, one and only one Page Impression may be counted.
In this case it does not matter how many files were used to
create the page. Furthermore, frame sets, style sheets, java
scripts etc. which are merely part of the construction of
the page must not be counted. Once again this creates a long
list of exclusions which the Log Companion is ideally suited
to handle. ABCE does not allow counting of general Hits since
this is completely dependant on the proportion of graphics
on the site.
Load Balanced Servers
Most Web Traffic Analysis tools analyse log files strictly
one-by-one, creating the statistics in a linear fashion. This
can cause particular problems when handling Load Balanced
web servers. Load balanced servers are used to increase the
performance of web sites by spreading the load of the traffic
across several machines, each typically having its own copy
of the web site. This set up is completely transparent to
the visitor to the site who does not notice that each page
he views may come from a different physical machine.
However each of the machines also creates a separate set
of log files which must all be analysed together to generate
correct statistics. This is especially important for Visit
calculations where many requests can be part of the same Visit.
In this scenario the requests are spread out between different
servers and it is not until you look at all the log files
together that you get the complete picture.
Since most Web Traffic analysis tools read the log files
one-by-one they cannot track Visits that span several different
log files. To solve this problem the Log Companion can merge
the separate log files, one for each load balanced server,
into one file where every request is organised in order of
time stamps. This file can then be processed by the analysis
tool which receives the log data from all the servers in one
file to allow it to correctly calculate the Visits spanning
all the servers.
Note: This function is at the moment only available
on request from ISSEL but will be rolled into a future release
of the Log Companion.
Log File Management
Popular web sites typically produce huge amounts of log files.
Multiple web sites obviously compounds the problem. The Log
Companion simplifies the management of huge log file volumes
by:
- Quickly processing vast log files. Log files size is typically
reduced 5 - 10 times when unnecessary data is removed which
typically speeds up the loading of those logs into your
analysis tool correspondingly.
- Processed log files can be archived and compressed which
typically reduces log file size 10 - 20 times, thereby saving
valuable disk space.
- Processed log files can be output to any location and
also renamed thereby simplifying any further handling by
your analysis tool.
Log Cleansing
Surprising though it may seem, in real life log files often
have a lot of "rubbish" stored in them. This is
data or characters that have no relation to log file data
whatsoever and can sometimes cause un-expected problems for
the analysis tools, not to mention odd entries in the reports
produced. The Log Companion can be used to clean up log files
and has a number of built-in features:
- option to exclude Control characters
- checks that each request has all the fields as defined
in the log file header or by the log file format
- checks that each request has basic characteristics which
must always be present
The Log Companion has the option to exclude any request line
which has an embedded Control character. This is an ASCII
character which cannot be represented by the normal alphabet.
The Log Companion also checks that each line has all the
entries which are supposed to be there. For example, for IIS
files it checks that each line has all the fields defined
in the header. For other formats it checks the fields based
on which log format has been selected.
In addition the Log Companion checks for some basic characteristics
for each line. For example, the URL file path must always
begin with a "/" character. For NCSA files it requires
that there is always a field for the protocol version "http/1.x"
etc.
Adding and Transforming Information
The Log Companion can add some information to the log files
to make analysis easier:
- ABCE Users
- web server name
In addition to adding information the Log Companion can also
transform some of the information in the log files:
- port number
- cookies
- query strings
- source fields
ABCE defines a User as every unique combination of Visitor
IP-Address and User-Agent. Whilst some analysis tools can
create this information out of the data in the log files sometimes
this prevents full analysis of the data. To simplify this
process the Log Companion can insert this combined value into
the User field in the log file, which is not normally used.
The Log Companion can also add proper Web Server names to
the log files. This is important for four main reasons:
- Many log files do not include any information about which
Web Server the data actually belongs to, not even in the
log file name. This is typically the case with Apache and
NCSA log files.
- Some log files have include information but in a format
which is not very useful to present in Web Traffic reports,
for example the Web Server IP addresses. This is typical
for IIS4 log files.
- The correct Web Server name may be present but not in
a field recognised by the Web Traffic Analysis tool.
- Some Web Traffic Analysis tools can produce automated
reports based on the Virtual Server information but this
only makes sense if the proper Web Server names are available
from the log files.
The Log Companion can either add any user defined text string
as the Web Server name or in the case of IIS5 log files, copy
the Host Header field to the Server IP Field. Normally the
server information is written to an existing field in the
log file but in the case of IIS4 and IIS5 the field will also
be added if it is missing.
The Log Companion can also change the Port Number in the
log files. This is especially useful if you have several Virtual
Servers which use different ports but are tobe analysed as
one server. For example you may have one Virtual Server using
the standard port number 80 and then a secure server for the
same web site using port 443. In this case some analysis tools
treat them as separate servers. The Log Companion can make
them appear as the same server for the analysis tool by changing
the Virtual Servers to use the same port, for example port
80 in this case.
The Log Companion has two special features for handling cookies
in an effective way:
- Extract the persistent cookie from the cookie field.
- Convert cookies to parameters.
The first feature can extract a particular persistant cookie
value from the cookie field. This is the value of the cookie
that represents a unique user coming to the site. After the
transformation only the persistant value will be left in the
cookie field and the rest of the cookie string discarded.
This make it easier for analysis tools to use the cookie value
for Visits calculations.
Converting cookies to parameters is useful to optimise the
reporting functions of the analysis tool as many have better
functions for reporting on parameters then on cookies. This
is also handy if the previous function is used (persistant
cookies) since it can be used to save selected cookies as
parameters which would other wise have been discarded.
The Log Companion can modify the query strings in the log
files. This option allows you to specify which query parameters
you want to keep with the Log Companion discarding all other
parameters. The function is available both for the query parameters
requested and also separately for the source query parameters.
Discarding parameters can have major benefits for simplifying
reporting and also saving space in the reporting system databases.
On web sites with dynamic content where the pages are defined
by he query string parameters this is particulary useful.
Automation
The Log Companion produces Batch files which can be scheduled
to run at specific times using the standard Windows NT scheduling
functions, for example by using the AT command.
The Log Companion can be used to control Marketwave HitList so
that HitList executes whenever the log file pre-processing
is completed or to off-load Marketwave HitList by executing Marketwave HitList directly after every log file is processed and thereby
assuring
that only one log file is processed by Marketwave HitList at a
time. When many large log files are to be loaded into Marketwave HitList at
once this prevents Marketwave HitList running out of system resources
on smaller machines.
Customisation
Web Sites can be designed in a multitude of unique ways and
by using Web Server programming functions it is also possible
to programme what gets written to the log files to some extent.
This can result in log files being produced which do not follow
the established standards so causing problems when it is time
to report on the traffic to the Web Site.
The Log Companion can be infinitely customised to handle special
requirements so that the log files are processed in the best
way for Web Traffic analysis. Here are just a few examples
of what can be done:
- Re-format the log file to fit in with standards so that
the Analysis tool can properly understand the file
- Move information between different fields to where they
are best processed by the Analysis tool
- Add missing information based on defined rules
- Remove information or whole request lines which do not
comply with defined rules, like the ABCE rules
- Transform fields to better suit the Analysis tools
Anything that can be described programmatically can be achieved.
Here are some examples of customisations completed:
- Copy custom user registration information from URL parameters
to Cookies
- Remove unique User ID information from the URL and put
it in the cookie field. This also makes the URL file
paths less unique and therefore easier to analyse
- Transform log files from proprietary formats to standardised
formats
|