 |
|
|
|
Types
of Hit List Filters, How they work and When to use them.
There are two major types of Filtering in Hit List. The
first are "exclusions" that keep certain data or types of
data out of your Hit List database completely. The second
are either report/element level "filters" that merely suppress
the information from being seen or calculated in a given
report, but don't strip out the information from the database.
In this latter type of filter, you can thus change it to
display different information in a different report or element
that might have been previously suppressed.
1) Exclusions and keeping specific data (or types
of data) out of your Hit List database completely.
The Options/Updates tab.
Keeping certain information or types of information out
of your database is done under Options/Updates. If you
go to this tab, you will see the five slots under "Update
the database with all requests except those". These are
used to keep certain data from being added to your database
from the logfiles, as follows:
"With these URLs (*.GIF, for example)." - This
setting enables you to exclude certain URLs or filetypes
from being loaded into your database when you don't
want them counted as activity on your website.
The most typical example is graphics files, ie
GIFs, JPEGs, etc. - usually you won't care about how
many times someone sees the "decorations" most websites
use to make them attractive to the visitor. What you
DO care about is whether they went to the purchase
page, downloaded a demo of your software, etc. and
those "important" URLs are the hits you want to count.
Allowing Hit List to record all the hits to GIFs,
etc. on your site when they are otherwise unimportant
will make your Requests seem artificially high and
is not an accurate reflection of your "important"
traffic.
Remember that this setting isn't limited to strictly
stripping out GIFs, etc. You might want to exclude
all requests to the "Kevin" directory, URLs that
would include /kevin/ in their path (ie /kevin/kevin.htm).
In that case, you would add */kevin/* here in addition
to the likely GIF and JPEG exclusions and the "kevin"
URLs would not be recorded in the Hit List database
either.
"Except (*/ads/*, for example)" - As discussed
immediately above, you usually will exclude graphics
files (GIF, etc.) from being recorded in the Hit List
database. However, if you are running Ads on your website,
these in many cases are in graphic format, leading to
another site.
A typical scenario:
"Kevin.com" website is paid $$ for an ad banner
leading to Marketwave.com, the ad banner having
a large GIF link at the top of Kevin.com's home
page that leads to Marketwave.com. Thus if visitors
to Kevin.com click on that GIF, they'll be sent
to Marketwave.com. And Kevin.com's webmaster set
up the website so that this Ad banner GIF is located
in the /ads/ directory, not in the directory where
all the other HTML, GIF, JPEG, etc. files are located.
This makes the ad GIF separated from those other
files and easier to maintain/update when necessary.
In the above example, the typical Hit List user
at Kevin.com would exclude all GIFs, etc from being
recorded EXCEPT the Marketwave Ad GIF for which
Marketwave is paying Kevin.com $$$. Marketwave.com
obviously wants to know how well the ad banner is
doing (if not well, Marketwave.com will likely take
their business elsewhere). So Hit List will count
this and other ad banner GIFs that are stored in
the /ads/ directory, but not any other graphics
or GIFs elsewhere on Kevin.com.
"From the following IP Addresses" - This setting
enables you to exclude requests and visits from your
own employees, webmasters, etc. by entering in their
IPs (ie 123.123.123.123) or range of IPs (ie 123.123.123*).
Thus, you won't artificially "inflate" your website's
activity from your own webmasters or employees who may
visit it as necessary (your webmasters will likely be
working on the site all the time, generating requests
and visits due to their job). If your webserver is doing
the reverse DNS for you (instead of Hit List), the IPs
may already be resolved in the logs, meaning in this
setting you would need to enter the internal "site names"
as they appear in the logs (ie kevino.marketwave.com
instead of his IP).
"Containing these user-agent strings" - This
setting lets you exclude data from certain types of
web browsers that hit your website. Or, another use
of this setting is to exclude data from spiders or robots
that catalog your website from time to time. You may
notice requests from certain search engine robots or
spiders in the "most popular browsers" element. You
can then examine the logs to get the exact string seen
in the logs, and enter it in this setting to exclude
requests from these spiders/robots (they really aren't
"visitors" to your site). Remember that if you do this
AFTER building a Hit List database from your logs, you
will likely need to rebuild your database to erase all
these requests and visits from these browsers.
"With these HTTP Status Codes" - This setting
allows you to exclude certain types of HTTP Status Codes
from being recorded in the Hit List database. Thus,
if you don't want to count the 404's (file not found)
people may encounter when visiting your site, exclude
them here. Often times a webmaster WANTS to know this
information, however, to know which pages on your site
are "old" and thus need to be updated or changed with
new information.
1a) QuickList Extreme Optimizations (Hit List
Enterprise v4 and Hit List Live v4 users only).
Hit List Enterprise v4 and Hit List Live v4 users
have another extremely flexible optimization possibility
under Options/Updates: the Quicklist Extreme Optimizations
button in the upper right of this tab. These optimizations
give you even more control over the data you store
in the Hit List database, as explained below:
Hit ListÍs internal QuickList technology allows it
to very efficiently summarize similar information.
ThatÍs why running a report that shows the Most Popular
Pages runs in the same amount of time whether itÍs
based on logs that were 50MB or logs that were 500MB.
The key issue is the "amount" of uniqueness because,
like all compression or summarization systems, QuickList
becomes less efficient when it has a large amount
of unique information to process.
For very large sites, usually those who generate
logs in excess of 300MB per day, it may not make sense
to have Hit List store all the information it otherwise
can. In particular, very large sites may find that
the majority of their database is filled with IP addresses,
cookies and query strings that arenÍt useful because
the site can almost assume that every possible IP
on the Internet has visited the site or that the marginal
value of this information, compared to storing very
long term trends and having fast reporting, isnÍt
sufficient to warrant the storage space. As noted,
on very popular sites, the majority of database space
is probably taken up by unique IP addresses from visitors
that may have visited you from AOL, MSN, etc. where
tracking such visitors' IP addresses may not be very
important. In addition to taking up space, Hit List,
like all database applications, slows down as the
overall size of the database grows. Therefore, storing
"excessive" data that isnÍt useful not only takes
up space but slows down reporting on essential datal.
Therefore, Marketwave suggests that very large sites
carefully consider what data they will use to make
decisions (not just data that is nice to look at)
and what data isnÍt required. Specifically, we suggest
that large sites disable storage of IP addresses,
Cookies (if you have them), possibly Application Arguments
(queries) and consider telling Hit List to store Sources
(referrers) as Domain only and ignore the "full" referring
URL. Additionally, if you use multiple web servers
that act as one, you should probably disable storage
of Virtual Server Information. If you donÍt use HTTP
Status codes for anything (to check for errors, for
example), they can also be disabled.
You may also consider using the Parsing tab of the
main Options dialog box to disable parsing of either
Application Arguments (queries) and/or Sources (referrer).
Tip: While turning off some of these options may
limit your analysis capabilities, keep in mind that
itÍs easy to create secondary databases with small
samples of your logs that contain all the information
youÍd like. This statistical sampling method, as opposed
to storing all data, all the time, is strongly recommended
for large sites.
QuickList Extreme Settings, one by one:
Application Arguments (queries)
For sites with CGI or ISAPI/NSAPI applications
that get passed query parameters, you may find
that a large portion of your database is filled
with these strings. If you do not make use of
this information, you may find your databases
remain much smaller without this data.
Bytes
This switch has no significant effect in the
current version of Hit List and is reserved for
future use.
User-Agents
Although there are only a very small number of
major web browsers, each specific build and version
(not to mention language and proxy) modifies the
user-agent string somewhat, potentially creating
lots of unique data. In most cases, however, this
optimization is not required and would not save
much storage space.
IP addresses
For most sites, this is where youÍll get the
most data savings. In fact, turning off this one
checkbox will usually allow your databases to
grow much, much longer than otherwise possible.
Turning this off will make it impossible for Hit
List to estimate the number of unique visitors
based on IP addressees but Hit List will still
calculate the number of Visits correctly. Also,
if your site uses persistent cookies rather than
IP addresses to identify Visitors, you can simply
use the Number of Unique Cookies element instead
of Number of Visitors to calculate the number
of Visitors.
HTTP Status Codes
The QuickList summarization system takes the
HTTP Response Code (200, 404, etc.) into account
when summarizing. Therefore, QuickList is somewhat
less efficient when faced with lots of possible
HTTP codes. If you donÍt use this information,
turning it off will save some space but, for most
sites, will not be a huge gain.
Virtual Server Information
This is a very important optimization for sites
that use a round-robin or other similar approach
to balance web load among multiple servers. Since
QuickList normally takes the IP address of the
web server into account when summarizing, a site
with 10 web servers would be forced to store approximately
10 times as much data as a site with just one
server. Therefore, if you never plan to differentiate
between these machines, disabling the storage
of Virtual Server Information will produce a massive
data savings.
Cookies
If your site collects persistent cookies, youÍll
probably want to keep them for later DataLink
or other uses. However, if your site is using
session cookies, from which you gain no long-term
information, you should probably tell Hit List
not to bother storing them. If you disable Cookie
storage, Hit List will not be able to use Cookies
when computing Visits but this could be a small
price to pay.
Only store this cookie
In some cases a site may issue more than one
cookie such as a session and persistent cookie.
Or, when using Hit List LiveÍs TCP/IP Data Collector,
a site may be sent cookies that it didnÍt ask
for. In either case, you can tell Hit List which
cookie to store by entering the cookieÍs name
here. If the same cookie is identified by more
than one name, you can enter multiple cookie names
separated by commas. For example, if your site
is running Microsoft IIS 4.0 and using ASP pages,
you may see cookies that are named ASPSESSION.
If you have your own cookies that are named MyCookie,
you can tell Hit List to just store yours by entering
MyCookie in the field. If it may also be called
MyCookie2, you could enter either MyCookie,MyCookie2
or MyCookie*.
Important: Cookies may be logged as name/value
pairs or just as values. That is, you may see
something like ASPSESSION=1234567890 or just 1234567890
in the cookie field in your logs. Hit List stores
the cookie differently depending on what, if anything,
is entered in the Only store this cookie field.
If nothing is entered, Hit List will always store,
verbatim, what is in the logs or encountered while
packet sniffing. If the cookie is ASPSESSION=1234567890,
then you will see ASPSESSION=1234567890 in the
Hit List database. If, on the other hand, you
told Hit List to only store a specific cookie,
only the value, not the name, will be stored.
In the example above, if you entered ASPSESSSION
into the Only store this cookie field, Hit List
would store just 1234567890 not ASPSESSION=1234567890.
This second form is generally more useful when
using DataLink to correlate web traffic with other
information.
Sources
Although IP addresses generally take the most
space in a database, storing the full referring
URL can also be a very large storage drain, especially
for sites frequently found by search engines that
pass long and complex query information. While
this information is very valuable, itÍs an excellent
candidate for the statistical sampling idea mentioned
above.
Very often the most interesting information is
which sites (Yahoo, Excite, News.COM, etc.) referred
visitors to your site, not the exact URL that
they came from. Therefore, Hit List offers you
the option of storing the entire referring URL
(including query) or just the much less unique
domain name. We strongly suggest using the Just
the Domain Name option for large sites.
2) Report and Element-level Filters.
Every report and/or report element in Hit List can have
one or more Filters. In Hit List 3.x versions and earlier,
element-level filters ALWAYS overrode report-level filters.
However, now in Hit List 4.x, element level filters can
either override OR combine with report-level filters to
provide ultimate flexibility in viewing your data.
To set up filters either at the report or element level,
highlight a Hit List report icon (ie Complete Analysis)
and hit Design. To set a report-level filter, go to the
Filter tab and adjust as seen in the examples below. To
set an element-level filter, highlight one of the elements
seen in the Outline tab (ie "total number of requests")
and hit Properties. Now go to its Filter tab and adjust
as seen in the examples below. After setting up either
filter (or both as desired) hit OK and then run the report
to see the new results. It's often beneficial to run a
report BEFORE setting any filters, so you have a baseline
of numbers for comparison to when you set a filter and
re-run the report.
Some General Filtering tips:
1) Remember, Filters don't "strip" your database
of information, just suppress it.
Remember that Filters, either under the Filter tab
of a report when in Design mode or within the Properties
of an element like "total number of requests" only
prevent the data from being seen (and force appropriate
calculations from Hit List) in the report. Thus if
you set up a filter like:
URLs Not Equal To /jeff.htm
you won't see the URL /jeff.htm show up in the report,
but /jeff.htm is still recorded in the database and
will "come back" if you remove the filter. In other
words, filters don't "strip out" or "remove" information
from the Hit List database, they merely suppress it.
2) Detail data usually required.
Nearly all types of filters, especially those like
"URLs within the visit" require Detail data in addition
to Summary. This setting is set/changed under Options/Updates
(Store) and if you by chance have it set to Summary
Data Only, your Filter in most cases won't work.
3) Summary vs. Detail Filtering.
If you are trying to filter with Summary-only data,
what you have to do is use "like elements" for "like
filters". For example, if you want to filter for an
IP of 123.123.123.123, you can filter for this IP
in elements like "most common visitors" or "most popular
visitor's Countries" because they are IP-based elements
(look at their Properties in Design mode, under the
Definition tab where it notes "standard report").
Conversely, if you want to filter for a URL of /kevin.htm,
you can use elements like "most popular pages", "most
popular URLs", etc. What you can't do is to
put the URL filter for /kevin.htm into the IP-based
element (ie "most common visitors") without Detail
data.
With Summary and Detail data, you can filter
an entire report (ie Complete Analysis) on URLs, Visitor
IPs, Site Names, referrers, etc. without limitation.
For more information on the differences between Summary
and Detail data, check either the Hit List Help "Summary
vs. Detail" topic or our Summary
vs. Detail document.
4) When in doubt, look at the exact URL.
It cannot be emphasized strongly enough that many
problems come from users not being familiar with the
contents of their logfiles. If you don't know what
the precise URL looks like that you want to filter
on, look at one of the logs in a text editor and find
out. It will really save you time in the long run
rather than guessing.
5) Hit List "object types" are another kind of Filter.
Remember that the Object Types discussed under Database
Manager can be used as a type of Filter. For reference,
look under the Filter tab in any report after clicking
Design. See all those checkboxes? If you uncheck one
or more, the corresponding objects (ie Pages, Applications,
etc.) won't show up in the report when run. This can
be a quicker way to get filtering done than memorizing
a bunch of URLs, if applicable to your site.
6) Element vs. Report level filtering.
Remember that now in Hit List v4 you can use element
level and report level filters together or apart from
one another. Thus make sure that any element level
filters (set in an element's Properties under the
Filter tab, when in Design mode on a report) are set
to "a combination of the report and element level
filters" if that is the desired outcome. A quick way
to see if you have any element level filters is to
open a report in Design mode and look for little black
arrows under the Outline tab in the various elements
- if you see one, it could indicate a filter has been
set (or that something else is unique about that element,
but it usually indicates a filter).
Examples:
Report-Level Filters
For instance, if you would like to see a "Complete
Analysis" report that details only usage by those
who work at Microsoft, highlight Complete Analysis,
click design, click on the Filter tab, and choose:
"Visitor Site Names" in the Filter Name drop-down,
choose "Equal to" for Comparison and *.Microsoft.com
for Value.
Click OK and run the report. You will now notice
that this filter has been applied to every element
in the report. (So if you look at Most Common Visitors,
it should only show employees of Microsoft.)
Element Level Filtering
Element level filtering is often used to isolate
or compare similar information within the same report.
For a built-in example of element-level filtering,
go into Design mode for the "Technical Analysis"
Report in the "Technical" folder. You'll notice
that "Bad Links" section has element-level filters
specified on the three Table elements in this section
(note the small black arrows on "Bad Links by Source
Site", "Bad Links by Source URL" and "Bad Requests
by Browser").
Go into the Properties for one of these elements.
You will see under its Filter tab that it has been
set to use "Status Codes" "Equal to" "404", indicating
the file looked for wasn't found on your site and
ultimately a possible problem with your website's
links. At the top of the element you'll notice that
it says "a combination of both filters" - you can
adjust this to "use the report level filters" or
"use these element level filters" as desired.
|
|
|
|
|