« Blogs et polarisation | Main | Tourner la page sur les blogs »

April 10, 2007

First example

I posted a couple of days ago, in general terms, giving two examples. I've got some feedback. Just enough to say thanks to Stéphane Hamel and Philippe Reynet, and to realize that I need to flesh out the idea a bit more. So, here I go:

1) My goal is to model non linear trends for a special kind of time series generated from web data. To illustrate, imagine a blog for which I have daily statistics. I assume that readership can be modeled the usual way (i.e. time series are made out of trend, cycles and random errors). I want to test the idea that the trend component is not linear or concave/convex, as is generally assumed by time series models (we take differences until the data is stable before applying procedures such as Box and Jenkins or other). More specifically, I want to explore the feasibility/usefulness of extracting s-shaped trends. One immediate application would be the ability to forecast the plateau that a series approaches. (I'll expand on this in future posts).

2) Let me take you by the hand, step by step, to examine a "simple" example of such time series. Figure 1 below shows the raw data, where we can readily see several outliers occuring at periods 24 to 28. Imagine, for illustrative purposes, that each point represents the number of unique daily visitors to a blog and that a hit storm occured at periods 24-28 after a post made it to Digg or some other high traffic referrer. I've uploaded the raw data (and procedures to estimate trend lines) here.

Figure 1: raw data

S1_2 

For the moment I will simply try to fit a linear trend line. Visual inspection will lead us to think that the trend line should stay close to the 175-190 range, i.e. slicing through the (seemingly) flat baseline. However, the hit storm will complicate matters.

My question in the first post was: what would you suggest to automate the process of dealing with outliers? Let me explain a bit further.

Figure 2 shows three estimates of the trend line. The first one (in red) shows the regression line that you'll get with an OLS procedure. The slope is negative, which would be interpreted as as sign that the blog draws less and less visitors as time passes.

Figure 2: Three estimations: OLS (biased), M-reg and OLS with 2 intercepts (OK)

S12

This result is suspect, but in this case, fairly obvious -- the OLS trend is created by the hit storm. Had the storm occured in period 60 or so, the trend would have been positive. (OLS is slicing through all data points. For those more technically inclined, OLS assumes that errors are randomly distributed, which is NOT the case here, very obviously. Simply put, OLS regression yields an inappropriate signal because the data is affected by a significant anomaly).

There are (at least) two ways in which one can deal with this situation: one can use robust regression (Huber 1981: Robust Statistics, Wiley) a general procedure where observations are weighted in proportion of their fit to the regression line. Or we can model the hit storm explicitly. Robust regression (M-regression) is general because we do not have to pre-specify which data points are outliers. The iterative weighting takes care of that, which permits automated cleaning, so to speak. A specified OLS regression requires that the analyst adds a dummy variable to indicate that an observation belongs to an outlying group. Feasible in this case, but impractical in others I will introduce in future posts.

The grey (OLS with 2 intercepts + 1 trend parameter) and green line (m-regression with one intercept and one trend parameter) are (1) pratically identical, (2) intuitively better fits than the OLS line as they follow nicely the baseline and (3) positive rather than negative.

Differences between OLS and OLS-2/robust regressions may seem trivial, but let me assure you that they are not. First, the essential conclusion is the opposite (the trend is negative if we rely on OLS, positive if we rely on robust regression). Second, the robust regression highlights the fact that the outliers are not meaningfully related to the bulk of the data (obviously!) which makes us focus on the relevant data range, displayed in figure 3. You may not believe it at first, but figure 3 shows the exact same data as figure 2, albeit it truncates outlying values. 

Figure 3: A closer look at the data

S13

Looking at the data with a proper range shows how inadequate the OLS estimate really was. In figure 3, we display the hit counts in the 150-190 range, ignoring the hit storm (which pushed hits in the 750 range). The red line (OLS trend estimate) does not track the data at all. The M-reg and OLS-2 are still indistinguishably close and track the data "nicely". In other words, using an appropriate technique matters enormously. (arguably more so when I'll move to non-linear trends, were the impact of random shocks is more severe).

And notice what happens in the vicinity of the storm. What looked like a flat line in figure 1 or 2 shows a typical cycling pattern, with a steep climb prior to the storm, and a trough following the storm.

So, for now, my belief is that a robust non-linear regression model would do the job. My belief is fragile however, as I have not tested it against other artefacts that I know do occur in the time series I've examined. As of today I would bet that robust estimates will do well with random outliers, but am not so shure in the case of steps -- so I ask again: if you are familiar with smart procedures that can spot patterns in time series and that could be used in conjunction with non linear regression models, let me know.   

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/245017/17593432

Listed below are links to weblogs that reference First example:

Comments

Post a comment

If you have a TypeKey or TypePad account, please Sign In

Recherche:


  • pour s'abonner
    Add to Google

kiva