Data Mining

Advertising on the Web and Data mining

The challenge of effective web advertisement primarily involves placing relevant ads on user requested web pages. Those ads must be relevant to a page receiver, that is relevant to the page context and/or directly to the user. What algorithms are being used for this? What trends are there now in business intelligence and data mining for […]

The challenge of effective web advertisement primarily involves placing relevant ads on user requested web pages. Those ads must be relevant to a page receiver, that is relevant to the page context and/or directly to the user. What algorithms are being used for this? What trends are there now in business intelligence and data mining for digital advertisement solutions?

Web Ad Placement

There are some vital factors for web ad placement that we state below:

  1. Positioning. The position of the ad on a page greatly influences its clickability. The click probability decreases exponentially as the position/sequence increases.
  2. Similarity to the user’s search queries. The ad attractiveness correlates with the query terms.
  3. Specialized media (topic oriented magazines/ blogs with a high ads click-through rate). Statistics prove that users do prefer a web page relevant ad rather than unrelated messy stuff.
  4. User relevant. The web media use the advantage of having some information about the user. This enables them to choose the ‘right’ ad for ‘the user’. Statistically and cumulatively one may determine a visitor’s interest for certain things, for example:
    • search queries (correlates with point 2)
    • social nets and related groups
    • email content exchange  (being used extensively by Google – ethicalness is not quite clear)
    • bookmarks and backlinks

Of course, lots of privacy issues get involved here; that we will leave for future discussions.

Search queries dependent ads

Much of the effectiveness of search advertising comes from the “adwords” model of matching search queries to advertisements. Search ads are placed among the results of a search query. Advertisers bid for the right to have their ad shown in response to certain queries, but they pay only if the ad is clicked on. The particular ads to be shown are selected by a complex process, which we consider in this post. It embraces search terms that the advertiser has bid for, the amount of their bid, the statistical probability that the ad will be clicked on, and the total budget the advertiser has offered for the service.

On-Line Algorithms

What algorithms are to be applied in placing ads in modern web business? The algorithms for getting ads, that are relevant to search queries, must be of the on-line nature. Strictly, on-line algorithms are those where data input is fed into the algorithm, without having the entire input available from the start. It very much looks like signal processing (DSP), where the input stream is not evident from the start. Input data stream might also be so fast, that the processing latency is of much higher importance compared to static presented data input. But this is not of our present scope.

‘On-line’ here refers to the nature of the algorithm, and should not be confused with ‘on-line’ meaning ‘on the Internet’

The main characteristic under our scope is that the on-line algorithm cannot access the data in any order (unlike the off-line algorithm). Even when it cannot grasp all the data, the algorithm must make some decisions. The list of online algorithms you might look at are here.
Using off-line algorithms for ad placement, we just observe search queries for a time period, and consider the bids advertisers made on search terms, as well as their advertising budgets for the time period. Thus the algorithm can then assign ads to the queries in a way that maximizes both the revenue to the search engine and the number of impressions that each advertiser gets. But since users who issue queries can’t wait for the aggregate results, on-line algorithms enter in here.

With on-line algorithms we may use past data for computing (like click-through rate), yet we cannot assume having certain queries or events in the future.

Greedy Algorithms

Since we can hardly calculate the maximum benefit with non-full-scope input data, why not be ‘greedy’ with every step of process, that is to calculate maximum profit at each step hoping to get optimum benefit in the total (global). This is the definition of greedy algorithm. Many on-line algorithms are of the greedy algorithm type. These algorithms make their decision in response to each input element by maximizing some function of the input data and the past information (like clicking behavior or the ad’s budget remaining). These greedy strategies in general do not always produce an optimal solution. So we consider next term.

Competitive ratio

Competitive ratio is the minimum rate of the online algorithm compared to the optimum off-line algorithms for the same case/problem over all possible inputs (especially the worst case). c = efficiency(online)/efficiency(offline) in the worst case. Through some calculation one may prove that generally for the greedy algorithm in ads problem the competitive ratio is not less than 1/2.

Balance Algorithm

But there are other than greedy algorithms, giving better a competitive ratio. Besides the maximum present revenue balance, the balance algorithm considers also the greatest remaining budget of an advertiser in order to decide which ad to show. Its ultimate competitive ratio (simplified adwords model) will go close to 0.63 as the number of bidders and queries grow. However it does not work optimally in some extreme cases.

So the Generalized Balance algorithm takes into consideration two more things:

  • each advertiser’s budget should be used, not just the greatest remaining budget
  • the benefit of higher bid values (product of the bid and the click-through rate)


In this example we compare the simplified situation for off-line and greedy on-line algorithms in ad placement.

Advertiser ‘A’  has bid 10 cents on the search term “cookie”. Advertiser B has bid 20 cents on both the terms “cookie” and “fruit”. Both have monthly budgets of $100. We have no past statistics on either of these terms. We presuppose that only one ad is  to be to displayed with one query. The obvious thing to do is to display B’s ad, because they bid more (greedy approach).

However, suppose there will be lots of search queries this month for “fruit” but very few for “cookie.” Then A will never spend its $100 budget, while B will spend its full budget even if we give the query to A.

The worst case that can happen is that 500 “fruit” queries arrive, followed by 500 “cookie” queries (see the figure below, at right).  The greedy algorithm will assign the first 500 ads to B, earning $100, and then has no ad for the next 500, earning nothing. The competitive ratio in this case is equal 2/3.

Specifically, if there will be at least 500 more queries for either “fruit” or “cookie,” then there is no harm, and potentially a benefit, in giving the query to A. It will still be possible for B to spend its entire budget, while we are increasing the amount of A’s budget that will be spent. This argument makes sense both from the point of view of the search engine, which wants to maximize total revenue, and from the point of view of both A and B advertisers, who presumably want to get all the impressions that their budgets allow.
If we could know the future, then we would know how many more “fruit” queries and how many more “cookie” queries were going to arrive this month. If that number is below 500, then we want to give the query to B to maximize revenue, but if it is 500 or more, then we want to give it to A. Since we don’t know the future, an on-line algorithm cannot always perform as well as an off-line algorithm.


These basic theses and the simplified example have shown the challenge and also an elementary approach (greedy, on-line) to ad placement related to search queries. In following post we will consider more advanced algorithms for data mining applied to web advertising.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.