Research Data Analysis of Web Traffic

I want to share on the research which was done by some Estonian students concerning web traffic analysis. The case study they undertook is about mining frequent user access patterns from web log ﬁles. The primary objective was to discover the most frequent browsing patterns by analyzing the browsing sessions in logs.

Since they worked with browser logs from the public school web site, some preliminary work is to be done.

Data preparation

Some steps were done for data preparation:

Data cleaning (only GET and POST requests are taken into consideration. OPTIONS requests were excluded; other refining)
Group user requests that were made during one visit were gathered into sessions. (t=30 min threshold)

Objective

Most important: whether or not there are groups of pages which users tend to visit together.
They proceeded to identify the frequent “item sets”: sets of pages most frequently appearing together in the visitors’ sessions. The data mining value called “support” was taken as an identifier whether a certain item set is frequent or not.

After identifying such item sets, we run into this problem: simply by listing all frequent item sets, we will not see much about how users actually use the website – for two reasons:

frequent item sets do not capture the order of pages visited
information about multiple page visits is lost when constructing item sets.

Frequent sequential patterns

Finding such sequences can be accomplished by searching for sequences of clicks that more often than not are followed by each other. E.g., let us consider pages A, B, C and D. The click sequence might be: A-B-C-B-D-C-B-A… This involved also the amount of time spent on those pages (derived from the log files by calculating the time interval between two clicks). There is a slight adjestment for the http request responce time. For the frequent sequences, they found all sessions that contain the sequence, and then calculated average time spent viewing each page in that sequence. The greatest amount of time, however, turned out to be in those frequent sequences where the pages were visited only because of navigation and not for content. This might help to eliminate this site’s shortcoming.

Path traversal pattern mining

From here, methods were used to mine only those reference sequences (patterns), that contain forward navigations. Those patterns illustrate what people are looking for (ex. destination pages) and eliminate back references from sessions.

User speciﬁed pages

One may also isolate the patterns with the most relevant pages (interesting for analysis). The output would contain the more relevant patterns with higher support. This method provides a simple way to analyze use cases separately and thoroughly.

Changes in patterns through time

The data set was split up into two sections: August and September data. Sequential patterns found for both datasets were compared with the same relative support. Results revealed the change in page content that people generally reach. The change was slight, but it revealed some measure of seasonal behavior by users.

Conclusion

The research shows a practical example of how traditional frequent pattern mining algorithms can be useful in a web analytics context and in understanding website users’ needs. Sequential patterns along with time spent on pages helped to identify hop-pages and interesting page sequences.

As for time changing methods, changes in patterns through time might be used to compare user interests for different time periods.