Statistics: Tracking and Storing

Knowing your visitors is one of the most important things about creating a website, both its layout and its design. One of the best things you can do is to research your visitors constantly. This article outlines multiple ways of storing visitor data, how to register page requests and how to work out how many unique visitors a site gets.

Storing the hits

Before you even start to write the system you will need to decide upon how you will be storing the page requests. Will you be storing each request separately in a raw format or will be parse the request then-and-there and store the information as a tally? Storing the requests “raw” means dumping as much or as little information as you want straight into the database where each piece of information - such as IP address, page requested, and the browser used for example - in a separate field.

The benefits of using a raw storage method include:

  • very fast to register the request,
  • more ways of sorting the data when viewing the data, such as finding detailed information about requests between any two times/dates,
  • more detailed data representation systems can be written at a later date, and
  • no information is lost, even changing the format of the browser agent signature for example can make you loose information.

On the other hand the parse-on-request method benefits from:

  • typically using less storage space, and
  • faster data representation.

Registering requests

Once you have decided upon how you will store the information it’s now time to populate the database. There is really only three ways of doing this in a stand-alone system: Registering a request purely on the server, using a client-side script that sends information back to the server, or a mixture of both.

Using a server-side script is the safest and fastest way for registering a request. If the page requested already uses a database and the page requests are also stored in this same database then you are saving time by not having to connect to another one or by having to initialize the script. One downside of this method is you have limited information about the visitor, mainly limited to the HTTP header and the time of the request. This information is normally all you need however. Another problem that you will need to face is fake visits from search engine bots and other crawlers/spiders. Most bots are nice however and actually say that they are a bot in the HTTP request but there are a few that don’t. There is no way you can stop it unless you also use something on the client-side.

Using a language such as JavaScript on the client-side allows you to retrieve more data about them. Information such as screen/browser resolution and Macromedia Flash support is not sent to the server when the page is requested. Bots also normally ignore these scripts as well so you will not get any fake requests. Nevertheless, if the visitor has JavaScript disabled then no information will be sent back to the server so the request will not be registered. This method will also register hits whenever a visitor presses the back button in their browser. On the plus side the client-side script will always register a hit even if page was cached locally, something that a server-side script cannot enforce or track.

It is possible to get the best of both worlds. By using both server-side and client-side scripts you can actually work out what percentage of your visitors have JavaScript enabled and out of those that do you can get information about their settings as well. The only real problem with doing this is that you will be doubling the load on your server, which may slow it down.

Working out who’s unique

One piece of vital information that is not directly included in page requests is finding out how many unique visitors you have and how many pages they viewed while there where on your site. The only two ways of working out this information is by either sorting the data by IP or by sending a cookie to the visitor’s browser.

Tracking by IP is extreme easy to do and this information is normally supplied when you store the data in a raw format so no extra storage space is required. Two things to watch out for however are groups of visitors that share IP addresses, such as people from education campuses or business offices, and people with dynamic IPs - which is almost everyone connected to the internet.

Using cookies is the best way as it almost guarantees each unique visitor will get a different session ID, even when they share their IP address with other machines. Cookies are also enabled on almost every browser by default and because they do not propose a security risk directly they are seldom deactivated. Nevertheless, as with tracking by IP, there is no way of finding out if a single machine is being used by multiple people.

Another way of tracking

One other way of tracking visitors is by parsing the web-server’s page request log. Using this method means you do not have to worry about entering information into a database manually and no extra load will be added to the server. There are a few huge drawbacks of this technique that you have to keep in mind however:

  • you need access to the logs themselves (Not normally available on shared servers),
  • not a lot of information is stored in them anyway,
  • they take forever to parse as they are normally text files where each request is on a separate line, and
  • server logs are regularly compressed and archived and not all may be available at any given time.

If you can it is highly recommended to stay away from using server logs to get traffic information for your sites as they were not made for the job.

Recap

The four main points that where raised in this article were:

  • storing information in a raw format provides more data and can be easily expanded upon at a later date,
  • page requests can be tracked by using a server-side or client-side script or a mixture of both, each method having equal advantages and disadvantages,
  • using cookies to work out unique visitors will provide more accurate results over tracking by IP address, and
  • it is possible to track visitors from a server log but it isn’t very helpful or efficient.

Further reading

Comments

Now usually this is the place where you can submit your own reactions the the stuff I talked about above but due to time issues and my lazy personality I haven’t actually written the comment system yet. I will eventually get around to finishing it once I get my reader count back up to what it was in the early days but until then feel free to contact me with your response.