PDA

View Full Version : What's a Unique Visitor?



Dan Grossman
03-15-2006, 11:46 AM
What do you consider a "unique" visitor? Do you know how popular software (web log analyzers, 3rd party stats services) determine this?

I'm taking a look at how accurate various means of logging stats for websites are. In a test I ran yesterday (looking at the data today), using a cookie to identify each visitor resulted in 944 "unique" visits to one of my sites, from only 592 IP addresses. Clearly there's some reason this is very inaccurate, whether people are simply not accepting cookies or are deleting them in less than a 24 hour period.

Do other programs/services simply count the distinct IP addresses during a day and call that unique visits? What about people surfing behind proxies, such as some Chinese traffic, some AOL traffic, some MSN traffic?

Any ideas on the most accurate way to go about this?

Chris
03-15-2006, 11:48 AM
There is no 100% accurate way.

I would look at a unique IP & Useragent within a 24 hour period and consider that a unique.

Dan Grossman
03-15-2006, 12:00 PM
It's a shame the cookie test didn't turn out as I'd hoped.

I've got two tables for testing some code I'm writing for gathering stats, a table for unique visits with lots of information about the visitor, and a smaller table for pageviews.

If the cookies were accurate enough I could've saved a lookup in the uniques table before writing another row, now I need to see if the IP/useragent is already in there for that day. Converting the timestamp to a date will add some processing time to the query, miniscule but may add up when dealing with billions per day.

Thanks for the input.

Dan Grossman
03-15-2006, 12:14 PM
I just had an idea.

The problem is only in the browsers not accepting cookies or deleting them. Anyone that does accept the 24 hour cookie tells me they've been to the site already that day without me having to query the database.

So I can only query to check for uniqueness when the cookie isn't present. That'll cut down the queries some while providing accuracy, I think?

Chris
03-15-2006, 02:52 PM
Why are you running a check for every user? Thought this was log analysis... meaning you have a log and then you analyze it later.

Dan Grossman
03-15-2006, 02:57 PM
I'm both building and analyzing the log. Part of this has to do with my web stats site, w3counter.com.

Chris
03-15-2006, 04:31 PM
But you don't need to analyze the log in realtime as it is being made right? Thats what it sounded like you were doing.



So I can only query to check for uniqueness when the cookie isn't present. That'll cut down the queries some while providing accuracy, I think?

Dan Grossman
03-15-2006, 04:37 PM
But you don't need to analyze the log in realtime as it is being made right? Thats what it sounded like you were doing.

If I only want to log unique visits to one of the tables, I do need to find out if the user is already in the log at the time of the visit.

Certainly you could deduce the number of uniques by analyzing a log of every hit, but logging every piece of data for every hit for thousands of sites may not be technically possible with a single server. The queries are larger, and take more time to execute, and hard drives are only so big.

The same level of reporting could be accomplished by only logging the full details (ip, time, hostname, useragent, screen res, color depth, referrer, page accessed) once per unique visitor, and only doing a smaller write on each pageview (ip, time, page accessed).

Cookies would've been a convenient way to know if it was someone's first visit to the site for some period of time since they'd avoid having to do a database query (looking at the log) to find out. But the number of people not accepting cookies is too high, so now my plan is to still use the cookies, and only do that lookup when the cookie isn't present.

Chris
03-15-2006, 05:01 PM
I don't think you're going to be doing yourself any favors in the resource end though by doing comparisons on every page view.

Its one larger query, or thousands of smaller queries.

Dan Grossman
03-15-2006, 05:34 PM
I don't think you're going to be doing yourself any favors in the resource end though by doing comparisons on every page view.

Its one larger query, or thousands of smaller queries.

Here's the deal. I intend to finish a long-planned complete rewrite of W3Counter to be more competitive with new services like Google Analytics.

I want to provide at least this set of stats to start with:



- Summary

- Visits & Pageviews
-- Daily Visitors (Any Date Range)
-- Visitors by Hour (Any Date Range)
-- Geographic Location (Any Date Range)
-- Recent Activity (Paged)
---- Links to paths of individual visitors

- Pages
-- Popular Pages
-- Entry Pages
-- Exit Pages
-- Visitor Paths

- Marketing
-- Referrers
-- Search Engines
-- Keywords

- Systems
-- Web Browsers
-- JavaScript
-- Operating Systems
-- Screen Resolutions
-- Color Depth


These are the basics you'd get from Analytics minus conversion tracking, which I don't want to provide.

In order to provide the detailed info, I need to log every hit. I need to know who, when, and where they are. That's all I need for every hit to figure out entry pages, exit pages, popular pages, and full visitor paths. The full data about who -- referrer, browser, system, etc only needs to be stored on the first visit for each session.

The basics for the stats are two tables per website. There are more, but that's as much as I need to talk about to see you or anyone else thinks. A uniques table with the full data, and a hits table with the ip/time/url.

So, my first thought was to use a cookie. On the first visit to the site, when there's no cookie detected, set one and do some processing:
- Use get_browser() to figure out their browser and os
- Run my functions for parsing the referring URL to find if it's a search engine and extract the keywords
- Write the row to the uniques table with the ip, time, visitorid (cookie), countrycode (geoip module installed for apache), browser, os, screenres, color depth, javascript enabled true/false, webpage they visited, referring url, search engine, search phrase, and some other data.

I did make a design decision there to do the parsing of the user agent and referring URL at that time instead of in some batch job later for a few reasons:
1) The processing time tested to be low enough that I think it can be done here without putting too much load on the cpu
2) I don't have to either parse the results at report time, or do batch processing. One method leads to very high cpu usage parsing thousands of lines to produce a report on the fly, the other method leads to delayed reporting instead of live reporting.

The first problem I ran into and mentioned was that I was getting a larger number of "uniques" visited than IPs -- much larger. It turns out that a high percentage of people aren't picking up the cookie for whatever reason. So now the new algorithm is:

- If there's no cookie
-- Generate a visitorid and set the cookie
-- Check the uniques table if a row already exists for the same day with the same IP and browser string
-- Parse the data and write the row to the uniques table only if a row doesn't already exist for that day
-- Write a row to the hits table (the quick 3 column query with no processing)
- If there is a cookie, just write a row to the hits table

Dan Grossman
03-15-2006, 09:12 PM
I'm sure nobody's interested, but I changed my mind again and decided on a slightly different method:

- Cookie is 30 days instead of 24 hour, to allow repeat visitor tracking
- Individual 'visits' are identified by sessions instead of cookies, so 'visits' and 'visitor paths' are defined by browser sessions instead of an arbitrary time
- This means an increase in the number of unique entries recorded (one per session instead of per day for each unique person)

I'm pretty excited about doing this project. I love dealing with lots of data. I have about 6000 active users right now and my plan is to collect data from all their counters for both the current and new logging while developing and testing, and to eventually release the new version fully populated with at least a month of data for every user.

The existing code on their sites collects enough info to do everything but javascript support detection, I'll just have to pass it into two sets of code at once. Running both at once will also double the load, which should let me anticipate any problems from higher capacity before actually changing over to any new code.

I'm also following Google's precedent and switching from generating my graphs (and there will be many more of them) with the GD library on the server to Flash-based graphs. They're much more attractive and put the burden of rendering on the client instead of the server, which greatly increases the capacity of the server for logging since the CPU isn't being bogged down rendering images.

------

One last important decision I haven't made, perhaps some of you have opinions. What is the value of historic 3rd party statistics. Are the general trends that you'd get from only unique/visit/pageview/reload counts (and no other details) for dates more than a month in the past something you'd want from a web stats service, or are the log-analyzer based stats sufficient?

Right now my scheme involves only the two tables I've mentioned: one for unique hits with lots of info about the visitor, and a small table for each hit. In order to avoid running out of space, these logs have to be capped at some size. I might choose to offer different sizes for different prices, either in terms of actual number of rows (100,000 pageviews) or days (30 days) of data. Either way, older data ceases to exist.

Either I can increment a 3rd table on every pageview -- update a uniques, pageviews, reloads counter for each date, or I can batch update everyone's tables at the end of the day (or upon logging in to their stats if that happens first, to preserve realtime reporting).

I'm thinking this extra step might be necessary, at the cost of hard drive space, as these records have to be held for a very long time.

KLB
03-15-2006, 09:37 PM
Required reading for anyone who wants to analyize their server logs and try to figure out the "uniques" their site gets should be: "Why web usage statistics are (worse than) meaningless" (http://www.goldmark.org/netrants/webstats/). Although this document is very old it still pretains to webstats today.

Basically in order to know exactly how many visits one's site gets and how many people visit one's site, one needs a stats package with a server log version of the Heisenberg compensator used in transporters on the U.S.S. Enterprise.

Dan Grossman
03-15-2006, 09:44 PM
Basically in order to know exactly how many visits one's site gets and how many people visit one's site, one needs a stats package with a server log version of the Heisenberg compensator used in transporters on the U.S.S. Enterprise.

I know that no stats are truly accurate, I simply want to be as accurate as my competitors, or if not that, as accurate as I can be with the hardware I have available to me (for the time being, w3counter.com gets a P4 3.0 with hyperthreading and a gig of ram, which may easily be upgraded to 2gb without moving to a new server entirely).

Given I have the browser available to me, unlike log-based analyzers, I think that browser sessions are the best identifier of actual visitor sessions I can provide, and cookies the best way of identifying returning visitors over multiple days, and IP-useragent pairs as identifiers for unique visitors during a day. If you have other suggestions I'd love to hear them -- I'm still designing things, this is when decisions are easy to change.