Home
 
contact us | search
products & services | download | support | order | partners | about issel

Types of Hit List Filters, How they work and When to use them.

There are two major types of Filtering in Hit List. The first are "exclusions" that keep certain data or types of data out of your Hit List database completely. The second are either report/element level "filters" that merely suppress the information from being seen or calculated in a given report, but don't strip out the information from the database. In this latter type of filter, you can thus change it to display different information in a different report or element that might have been previously suppressed.

1) Exclusions and keeping specific data (or types of data) out of your Hit List database completely.


The Options/Updates tab.

Keeping certain information or types of information out of your database is done under Options/Updates. If you go to this tab, you will see the five slots under "Update the database with all requests except those". These are used to keep certain data from being added to your database from the logfiles, as follows:

  • "With these URLs (*.GIF, for example)." - This setting enables you to exclude certain URLs or filetypes from being loaded into your database when you don't want them counted as activity on your website.

    The most typical example is graphics files, ie GIFs, JPEGs, etc. - usually you won't care about how many times someone sees the "decorations" most websites use to make them attractive to the visitor. What you DO care about is whether they went to the purchase page, downloaded a demo of your software, etc. and those "important" URLs are the hits you want to count. Allowing Hit List to record all the hits to GIFs, etc. on your site when they are otherwise unimportant will make your Requests seem artificially high and is not an accurate reflection of your "important" traffic.

    Remember that this setting isn't limited to strictly stripping out GIFs, etc. You might want to exclude all requests to the "Kevin" directory, URLs that would include /kevin/ in their path (ie /kevin/kevin.htm). In that case, you would add */kevin/* here in addition to the likely GIF and JPEG exclusions and the "kevin" URLs would not be recorded in the Hit List database either.

  • "Except (*/ads/*, for example)" - As discussed immediately above, you usually will exclude graphics files (GIF, etc.) from being recorded in the Hit List database. However, if you are running Ads on your website, these in many cases are in graphic format, leading to another site.

    A typical scenario:

    "Kevin.com" website is paid $$ for an ad banner leading to Marketwave.com, the ad banner having a large GIF link at the top of Kevin.com's home page that leads to Marketwave.com. Thus if visitors to Kevin.com click on that GIF, they'll be sent to Marketwave.com. And Kevin.com's webmaster set up the website so that this Ad banner GIF is located in the /ads/ directory, not in the directory where all the other HTML, GIF, JPEG, etc. files are located. This makes the ad GIF separated from those other files and easier to maintain/update when necessary.

    In the above example, the typical Hit List user at Kevin.com would exclude all GIFs, etc from being recorded EXCEPT the Marketwave Ad GIF for which Marketwave is paying Kevin.com $$$. Marketwave.com obviously wants to know how well the ad banner is doing (if not well, Marketwave.com will likely take their business elsewhere). So Hit List will count this and other ad banner GIFs that are stored in the /ads/ directory, but not any other graphics or GIFs elsewhere on Kevin.com.

  • "From the following IP Addresses" - This setting enables you to exclude requests and visits from your own employees, webmasters, etc. by entering in their IPs (ie 123.123.123.123) or range of IPs (ie 123.123.123*). Thus, you won't artificially "inflate" your website's activity from your own webmasters or employees who may visit it as necessary (your webmasters will likely be working on the site all the time, generating requests and visits due to their job). If your webserver is doing the reverse DNS for you (instead of Hit List), the IPs may already be resolved in the logs, meaning in this setting you would need to enter the internal "site names" as they appear in the logs (ie kevino.marketwave.com instead of his IP).

  • "Containing these user-agent strings" - This setting lets you exclude data from certain types of web browsers that hit your website. Or, another use of this setting is to exclude data from spiders or robots that catalog your website from time to time. You may notice requests from certain search engine robots or spiders in the "most popular browsers" element. You can then examine the logs to get the exact string seen in the logs, and enter it in this setting to exclude requests from these spiders/robots (they really aren't "visitors" to your site). Remember that if you do this AFTER building a Hit List database from your logs, you will likely need to rebuild your database to erase all these requests and visits from these browsers.

  • "With these HTTP Status Codes" - This setting allows you to exclude certain types of HTTP Status Codes from being recorded in the Hit List database. Thus, if you don't want to count the 404's (file not found) people may encounter when visiting your site, exclude them here. Often times a webmaster WANTS to know this information, however, to know which pages on your site are "old" and thus need to be updated or changed with new information.


    1a) QuickList Extreme Optimizations (Hit List Enterprise v4 and Hit List Live v4 users only).

    Hit List Enterprise v4 and Hit List Live v4 users have another extremely flexible optimization possibility under Options/Updates: the Quicklist Extreme Optimizations button in the upper right of this tab. These optimizations give you even more control over the data you store in the Hit List database, as explained below:

    Hit ListÍs internal QuickList technology allows it to very efficiently summarize similar information. ThatÍs why running a report that shows the Most Popular Pages runs in the same amount of time whether itÍs based on logs that were 50MB or logs that were 500MB. The key issue is the "amount" of uniqueness because, like all compression or summarization systems, QuickList becomes less efficient when it has a large amount of unique information to process.

    For very large sites, usually those who generate logs in excess of 300MB per day, it may not make sense to have Hit List store all the information it otherwise can. In particular, very large sites may find that the majority of their database is filled with IP addresses, cookies and query strings that arenÍt useful because the site can almost assume that every possible IP on the Internet has visited the site or that the marginal value of this information, compared to storing very long term trends and having fast reporting, isnÍt sufficient to warrant the storage space. As noted, on very popular sites, the majority of database space is probably taken up by unique IP addresses from visitors that may have visited you from AOL, MSN, etc. where tracking such visitors' IP addresses may not be very important. In addition to taking up space, Hit List, like all database applications, slows down as the overall size of the database grows. Therefore, storing "excessive" data that isnÍt useful not only takes up space but slows down reporting on essential datal.

    Therefore, Marketwave suggests that very large sites carefully consider what data they will use to make decisions (not just data that is nice to look at) and what data isnÍt required. Specifically, we suggest that large sites disable storage of IP addresses, Cookies (if you have them), possibly Application Arguments (queries) and consider telling Hit List to store Sources (referrers) as Domain only and ignore the "full" referring URL. Additionally, if you use multiple web servers that act as one, you should probably disable storage of Virtual Server Information. If you donÍt use HTTP Status codes for anything (to check for errors, for example), they can also be disabled.

    You may also consider using the Parsing tab of the main Options dialog box to disable parsing of either Application Arguments (queries) and/or Sources (referrer).

    Tip: While turning off some of these options may limit your analysis capabilities, keep in mind that itÍs easy to create secondary databases with small samples of your logs that contain all the information youÍd like. This statistical sampling method, as opposed to storing all data, all the time, is strongly recommended for large sites.

    QuickList Extreme Settings, one by one:


  • Application Arguments (queries)

    For sites with CGI or ISAPI/NSAPI applications that get passed query parameters, you may find that a large portion of your database is filled with these strings. If you do not make use of this information, you may find your databases remain much smaller without this data.

  • Bytes

    This switch has no significant effect in the current version of Hit List and is reserved for future use.

  • User-Agents

    Although there are only a very small number of major web browsers, each specific build and version (not to mention language and proxy) modifies the user-agent string somewhat, potentially creating lots of unique data. In most cases, however, this optimization is not required and would not save much storage space.

  • IP addresses

    For most sites, this is where youÍll get the most data savings. In fact, turning off this one checkbox will usually allow your databases to grow much, much longer than otherwise possible. Turning this off will make it impossible for Hit List to estimate the number of unique visitors based on IP addressees but Hit List will still calculate the number of Visits correctly. Also, if your site uses persistent cookies rather than IP addresses to identify Visitors, you can simply use the Number of Unique Cookies element instead of Number of Visitors to calculate the number of Visitors.

  • HTTP Status Codes

    The QuickList summarization system takes the HTTP Response Code (200, 404, etc.) into account when summarizing. Therefore, QuickList is somewhat less efficient when faced with lots of possible HTTP codes. If you donÍt use this information, turning it off will save some space but, for most sites, will not be a huge gain.

  • Virtual Server Information

    This is a very important optimization for sites that use a round-robin or other similar approach to balance web load among multiple servers. Since QuickList normally takes the IP address of the web server into account when summarizing, a site with 10 web servers would be forced to store approximately 10 times as much data as a site with just one server. Therefore, if you never plan to differentiate between these machines, disabling the storage of Virtual Server Information will produce a massive data savings.

  • Cookies

    If your site collects persistent cookies, youÍll probably want to keep them for later DataLink or other uses. However, if your site is using session cookies, from which you gain no long-term information, you should probably tell Hit List not to bother storing them. If you disable Cookie storage, Hit List will not be able to use Cookies when computing Visits but this could be a small price to pay.

  • Only store this cookie

    In some cases a site may issue more than one cookie such as a session and persistent cookie. Or, when using Hit List LiveÍs TCP/IP Data Collector, a site may be sent cookies that it didnÍt ask for. In either case, you can tell Hit List which cookie to store by entering the cookieÍs name here. If the same cookie is identified by more than one name, you can enter multiple cookie names separated by commas. For example, if your site is running Microsoft IIS 4.0 and using ASP pages, you may see cookies that are named ASPSESSION. If you have your own cookies that are named MyCookie, you can tell Hit List to just store yours by entering MyCookie in the field. If it may also be called MyCookie2, you could enter either MyCookie,MyCookie2 or MyCookie*.

    Important: Cookies may be logged as name/value pairs or just as values. That is, you may see something like ASPSESSION=1234567890 or just 1234567890 in the cookie field in your logs. Hit List stores the cookie differently depending on what, if anything, is entered in the Only store this cookie field. If nothing is entered, Hit List will always store, verbatim, what is in the logs or encountered while packet sniffing. If the cookie is ASPSESSION=1234567890, then you will see ASPSESSION=1234567890 in the Hit List database. If, on the other hand, you told Hit List to only store a specific cookie, only the value, not the name, will be stored. In the example above, if you entered ASPSESSSION into the Only store this cookie field, Hit List would store just 1234567890 not ASPSESSION=1234567890. This second form is generally more useful when using DataLink to correlate web traffic with other information.

  • Sources

    Although IP addresses generally take the most space in a database, storing the full referring URL can also be a very large storage drain, especially for sites frequently found by search engines that pass long and complex query information. While this information is very valuable, itÍs an excellent candidate for the statistical sampling idea mentioned above.

    Very often the most interesting information is which sites (Yahoo, Excite, News.COM, etc.) referred visitors to your site, not the exact URL that they came from. Therefore, Hit List offers you the option of storing the entire referring URL (including query) or just the much less unique domain name. We strongly suggest using the Just the Domain Name option for large sites.


  • 2) Report and Element-level Filters.

    Every report and/or report element in Hit List can have one or more Filters. In Hit List 3.x versions and earlier, element-level filters ALWAYS overrode report-level filters. However, now in Hit List 4.x, element level filters can either override OR combine with report-level filters to provide ultimate flexibility in viewing your data.

    To set up filters either at the report or element level, highlight a Hit List report icon (ie Complete Analysis) and hit Design. To set a report-level filter, go to the Filter tab and adjust as seen in the examples below. To set an element-level filter, highlight one of the elements seen in the Outline tab (ie "total number of requests") and hit Properties. Now go to its Filter tab and adjust as seen in the examples below. After setting up either filter (or both as desired) hit OK and then run the report to see the new results. It's often beneficial to run a report BEFORE setting any filters, so you have a baseline of numbers for comparison to when you set a filter and re-run the report.


    Some General Filtering tips:

    1) Remember, Filters don't "strip" your database of information, just suppress it.

    Remember that Filters, either under the Filter tab of a report when in Design mode or within the Properties of an element like "total number of requests" only prevent the data from being seen (and force appropriate calculations from Hit List) in the report. Thus if you set up a filter like:

    URLs Not Equal To /jeff.htm

    you won't see the URL /jeff.htm show up in the report, but /jeff.htm is still recorded in the database and will "come back" if you remove the filter. In other words, filters don't "strip out" or "remove" information from the Hit List database, they merely suppress it.

    2) Detail data usually required.

    Nearly all types of filters, especially those like "URLs within the visit" require Detail data in addition to Summary. This setting is set/changed under Options/Updates (Store) and if you by chance have it set to Summary Data Only, your Filter in most cases won't work.

    3) Summary vs. Detail Filtering.

    If you are trying to filter with Summary-only data, what you have to do is use "like elements" for "like filters". For example, if you want to filter for an IP of 123.123.123.123, you can filter for this IP in elements like "most common visitors" or "most popular visitor's Countries" because they are IP-based elements (look at their Properties in Design mode, under the Definition tab where it notes "standard report"). Conversely, if you want to filter for a URL of /kevin.htm, you can use elements like "most popular pages", "most popular URLs", etc. What you can't do is to put the URL filter for /kevin.htm into the IP-based element (ie "most common visitors") without Detail data.

    With Summary and Detail data, you can filter an entire report (ie Complete Analysis) on URLs, Visitor IPs, Site Names, referrers, etc. without limitation.

    For more information on the differences between Summary and Detail data, check either the Hit List Help "Summary vs. Detail" topic or our Summary vs. Detail document.

    4) When in doubt, look at the exact URL.

    It cannot be emphasized strongly enough that many problems come from users not being familiar with the contents of their logfiles. If you don't know what the precise URL looks like that you want to filter on, look at one of the logs in a text editor and find out. It will really save you time in the long run rather than guessing.

    5) Hit List "object types" are another kind of Filter.

    Remember that the Object Types discussed under Database Manager can be used as a type of Filter. For reference, look under the Filter tab in any report after clicking Design. See all those checkboxes? If you uncheck one or more, the corresponding objects (ie Pages, Applications, etc.) won't show up in the report when run. This can be a quicker way to get filtering done than memorizing a bunch of URLs, if applicable to your site.

    6) Element vs. Report level filtering.

    Remember that now in Hit List v4 you can use element level and report level filters together or apart from one another. Thus make sure that any element level filters (set in an element's Properties under the Filter tab, when in Design mode on a report) are set to "a combination of the report and element level filters" if that is the desired outcome. A quick way to see if you have any element level filters is to open a report in Design mode and look for little black arrows under the Outline tab in the various elements - if you see one, it could indicate a filter has been set (or that something else is unique about that element, but it usually indicates a filter).

    Examples:

    Report-Level Filters

    For instance, if you would like to see a "Complete Analysis" report that details only usage by those who work at Microsoft, highlight Complete Analysis, click design, click on the Filter tab, and choose:

    "Visitor Site Names" in the Filter Name drop-down, choose "Equal to" for Comparison and *.Microsoft.com for Value.


    Click OK and run the report. You will now notice that this filter has been applied to every element in the report. (So if you look at Most Common Visitors, it should only show employees of Microsoft.)

    Element Level Filtering

    Element level filtering is often used to isolate or compare similar information within the same report.

    For a built-in example of element-level filtering, go into Design mode for the "Technical Analysis" Report in the "Technical" folder. You'll notice that "Bad Links" section has element-level filters specified on the three Table elements in this section (note the small black arrows on "Bad Links by Source Site", "Bad Links by Source URL" and "Bad Requests by Browser").

    Go into the Properties for one of these elements. You will see under its Filter tab that it has been set to use "Status Codes" "Equal to" "404", indicating the file looked for wasn't found on your site and ultimately a possible problem with your website's links. At the top of the element you'll notice that it says "a combination of both filters" - you can adjust this to "use the report level filters" or "use these element level filters" as desired.


    p +44-(0)870-166-2435, f +44-(0)870-054-8795, e info@issel.co.uk
    © 1996-2004 Intranet Software Solutions (Europe) Limited. All rights reserved.