Research: www series

March 14, 2008

Recession?

The noise is almost deafening -- recession is all but upon us. Look at the trends.google.com chart below. The "buzz" (number of search queries) is way up. The first jitter occurred about a year ago. The news headline of the day said that according to economists, a recession was unlikely. Click on the chart to access the (most recent) chart and mull over the morphing newscape.

Recession

And then, consider this: actual data on the US GDP, available here from the Bureau of Economic Analysis. For convenience I have computed the average annual GDP growth rate in constant USD (2000). (warning : the time scales of the two charts are different. The BEA data covers the 1947-2007 time period)

Gdp19472007_2

Is the buzz picking up early signs of a weakening economy? (such as the credit crunch). Is the buzz feeding on itself to ultimately yield a self-fulfilling prophecy? (by affecting consumer/investor confidence). Could the buzz be just that -- buzz?

February 03, 2008

Jamming the signal

The signal emanating from the digital sphere can be closely related to the "real" public opinion. Based on data on the French presidential race of 2007, a visibility index designed by Swammer correlates reasonably well with the results of opinion polls and with the actual result of the electoral race. (read further down in the www series category to find relevant posts).

The situation might be different elsewhere. While there is considerable buzz surrounding Ron Paul, a republican candidate, his bid fails to translate into significant numbers if we consider tracking polls.

Visibility data as captured by Swammer places Paul a distant fourth, with roughly half of the leader's score (McCain). If ranking matches tracking polls, the distance if much less -- tracking polls give a 5-to-1 lead to McCain compared to a 2-1 lead in visibility. More striking, perhaps, is Google trends, which shows Ron Paul ahead of McCain in the number of search queries.

Why?

Understanding when/how/why Internet-initiated buzz becomes commonplace opinions may have to do with the breadth of the buzz, more than it relates to its intensity.

January 25, 2008

On the accuracy of opinion polls

A talk today by a sociologist, on the French elections (Spring 2007). One interesting comment -- French pollsters massage their results in ways that are not totally clear. Pr Durand referred to the 2002 "debacle" where polls put Chirac (a conservative) and Jospin (a socialist) clearly in the lead. One thesis is that "reverse strategic voting" happened (i.e. supporters of front runners felt free to "send a message" and voted for someone else, that someone being Le Pen, a candidate from the far right). Another thesis is that polls were just plain wrong. Pr. Durand made comments to the effect that some pollsters had raw data suggesting that Le Pen was ahead of Jospin close to election time, but "calibrated" (or fudged?) the data, thinking that it was just not a reasonable hypothesis.

As we are in the process of comparing webtrends with opinion polls, it is important to assess the accuracy of opinion polls. So, back in the office, I finished assembling the data available from this site for analysis.

Opinion polls are a national pasttime in France. The dataset has no less than 105 polls published by 7 leading organizations, starting in October 2006 up to April the 20th (election day being April 22nd).

Polls included voting intentions for most leading candidates. I'll report only on the top 2 (Royal and Sarkozy).

The figure below shows their score over time.

Rs1

I prefer to work on the ratio of scores, in order to isolate the effect of other candidates (i.e. both Royal and Sarkozy's share are going down because of the strong showing by Bayrou. In the end these do not matter as only the first and the runner-up make it to the final electoral round). So here are the ratio of Royal on Sarkozy, over time. I've also added a moving average (5-period centered) for convenience.

Rs2

The raw series is not smooth at all. The jerky motion is not random -- there is a systematic pollster impact. Now, we know that there IS a true voting intention, measured at various points in time by various pollsters using different calibration approaches. So the goal here is to try to isolate pollster bias to recover the "true" voting intent, which we will eventually related to web data.

So I've set-up a quick and dirty model using my favorite functional form (a + b(T^d/T^d+g)), replacing "a " with pollster-specific parameters.

First, results show surprisingly large discrepancies between pollsters. Systematic differences range from -.06 to +.08 (now, this is for the shares ratio. Translated back into voting intentions, these differences look like 49% 51% on the one hand, and 45% 55% on the other. These differences would be dismissed as falling close the the margin of error. Inappropriate in the case of systematic runs, but we do not want to become technical here, so let's move on). It is these systematic differences that produce the jagged yellow prediction line. (see graph below)

Second, the model predicted results that are very close to the actual outcome. In other words, averaging polls ended up being a reasonably accurate predictor. And moreover, the non linear model reaches its plateau at the very second time period. No surprise. (now, this model *fits* the data. I am not saying that the outcome was known from the beginning. More work to do. But at first glance, people in the know probably knew the outcome in February, barring the unexpected.)

Rs3

To summarize:

1) French pollsters use calibration rules yielding systematic bias
2) Their results can be used to generate what appears to be an accurate forecast
3) The "data generating process" appeared to be stable in 2007.

October 29, 2007

The validity of digital trends

At ECIG, Jean-François Trinquecoste asked whether digital trends were valid proxies for public opinion. The argument is that the blogger population is probably not a representative cross-section of the public at large. Good point. As I wrote earlier, this question is one of those I want to look at in the near future.

For now:

1) Digital trends are apparently good predictors of end-point behaviors. Swammer, who provided the data for my analysis, has a blurb here showing that their visibility index outperformed most conventional opinion polls on the French presidential race.

2) But as we look at the series over time, it is clear that visibility scores computed by Swammer differ from opinion polls. See for instance this site where Le Monde, a French Daily, charts opinion polls over time. Whereas opinion polls showed a relatively stable public opinion (Royal and Sarkozy being neck and neck, with "maybe" a slight drop in support to Royal over time, depending on which surveys you look at), visibility scores tell a different story.

Data on IFOP polls as provided by Le Monde look like this:

Ifop

Visibility scores as provided by Swammer look like this:Visibility

Both charts look rather similar, but they tell an entirely different story -- opinion polls suggest that Sarkozy has gathered momentum during the campaign whereas visibility scores suggest just the opposite.

If we look at absolute visibility scores over time, the story takes a new meaning as we watch both candidates becoming more and more visible in the digital sphere, both achieving important absolute gains dwarfing relative differences. 

T

Now, what can we make out of this?

1) Digital trends could converge towards electoral behaviors merely because uncertainty diminishes i.e. trends would be meaningless random walks, but this is unlikely given the signal to noise ratio. (if digital trends were to randomly approach true behavior over time as uncertainty over the outcome melts away, we would observe diminishing variance around the outcome rather than convergence from a distant initial value.)

2) A more involved question is whether digital trends are a cause of an effect? My implicit assumption was that the digital sphere captures public opinion. But it might be argued that the process works the other way around - public opinion is shaped by influencers operating inside the digital sphere.

The digital-creates-opinions interpretation could be supported by two lines of argument. First, it is a natural extension of the dominant view on the relationship that exists between mass media and public opinion. Second it could rely on the widely held view that the digital sphere is largely the work of a social elite (influencers).

The opinion-creates-digital view would argue that blogs are the direct, unmediated expression of public opinion. And even admitting that the digital sphere is the work of a social elite, it could well be that this work obeys the universal law of supply and demand. Bloggers (and mass media) want readers and will therefore write what the market wants to hear.

I wish to raise these questions in light of a suggestion that was made to correlate digital trends with the results of tracking opinion polls. As we can argue (convincingly?) on both sides of the causal relationship, correlations will not be useful indicators as they will not answer the validity question. We'll have to construct structural models to separate the (potential) effect that opinion polls have on blogs and vice-versa. What we have learned so far is that opinion polls and digital trends are not interchangeable measures. How they interact and how they capture actual opinions / intentions / behaviors remain open questions.

Thanks to Jean François for vehemently raising questions, and to Philippe Reynet for finding opinion polls data literally within minutes.

October 22, 2007

ECIG - presentation

I've presented "Digital Trends" to ECIG 2007. Very well received. Slides are available here

Several interesting questions and comments. One very respected participant expressed concern as to the validity of relying on the digital sphere as a proxy for public opinion.

1) Good point that must be investigated. As I indicated in my talk, my modest goal was to ascertain the apparent reliability of certain time series extracted from the digital sphere using the (non trivial) procedure developed by Swammer. Preliminary results are very encouraging. Next steps will explore validity issues (in simple terms, time series seem to track well. That doesn't mean that we know what they track)

2) Swammer shows here that web series are a good predictor of actual voting behavior at the time of the election.

June 16, 2007

First draft

Here is the first draft of a paper on the modelling of time series extracted from the digital sphere. The basic idea is to try to see if such series have a strong signal to noise ratio, and if they have, what methods can be appropriate to capture the trend signal.

In a nutshell -- yes the data has a strong signal component. It requires robust parametric models. Non parametric methods such as SSA (a very powerful methodology that is becoming increasingly popular) does a great job at extracting rich trend data, but can be thrown out of whack if there are runs of extreme values (such as a hit storm).

Next steps are: (1) moving from predictions to forecasts (predictions are made within the range of actual observations -- we try to fit the data as best as we can; forecasts are made outside the range); (2) designing an approach that will blend parametric models (extremely useful to quickly scan loads of series) and non-parametric (extremely useful to draw rich accounts of trend data).

A revised version of the paper is available here. This paper has been presented at ECIG 2007.

(I would like to extend my gratitude to Stephane Muller, Jerome Coutard and Isabelle Dornic who gave me access to excellent data)

May 08, 2007

Another example

(if you come across this post by accident -- this post is about a time series that I am not at liberty to discuss freely. You may want to click on the category link to get a feel for what is happening)

Below are the two main series. There is one very obvious outlier at period 200.

Rs1_4 

I've computed the ratio of one series over the other (R/S). I've also fitted a robust estimate of a non-linear trend of the form y = a + b(T^delta/(T+gamma)). A robust estimator is probably better in this case because there are several outliers (9 obvious instances). There is also a noticeable wave pattern occurring between periods 145-209.

Rs2

The third graph shows the reciprocal series (S/R).

Rs3_2   

April 16, 2007

Examples of random walk

Below, three charts to show what a random walk process looks like. These graphs were generated in Excel by the cumulative addition of a random uniform value in the -1 to +1 interval. The graphs show three successive updates (I did not spend hours to cherry pick unusual charts). Each graph shows the resulting value over 32,000 completely random realizations. Trends and cycles are illusory.

Random1

Random2

Random3 

April 12, 2007

fitting a non-linear trend on first example

I've fitted a non-linear trend on the data described in the first example, to give a better idea of what I am doing. I've used a robust estimation procedure and a 4-parameter non linear trend function (the green line). The good news is that the trend line tracks the data very well and yields a "credible" conclusion (the series is stable, i.e. not trending up or downwards. A shift occured around period 7 which propelled the trend to a higher plateau by period 22 -- the hit storm is entirely discounted because it lies so far out of the "usual" data range.)

The bad news is that the estimation procedure is sensitive to seed values given to parameters (i.e. the loss function is not well behaved). While the essential conclusion (that the series has reached a plateau at 172 or so) remains unchanged across different solutions, the S that you can see between periods 7-22 can be replaced by a concave curve (i.e. an extremely brief S, too short to be seen) starting at the 157 level.

S15

To summarize -- a robust non-linear trend line appears to be a viable solution to provide a compact analysis of a time series such as the one described above, compared to OLS (linear) and Robust linear estimates.

I'll provide similar charts for a few other prototypical series over the next few days.

(Click on the category - see link below - for an unbroken thread)

April 10, 2007

First example

I posted a couple of days ago, in general terms, giving two examples. I've got some feedback. Just enough to say thanks to Stéphane Hamel and Philippe Reynet, and to realize that I need to flesh out the idea a bit more. So, here I go:

1) My goal is to model non linear trends for a special kind of time series generated from web data. To illustrate, imagine a blog for which I have daily statistics. I assume that readership can be modeled the usual way (i.e. time series are made out of trend, cycles and random errors). I want to test the idea that the trend component is not linear or concave/convex, as is generally assumed by time series models (we take differences until the data is stable before applying procedures such as Box and Jenkins or other). More specifically, I want to explore the feasibility/usefulness of extracting s-shaped trends. One immediate application would be the ability to forecast the plateau that a series approaches. (I'll expand on this in future posts).

2) Let me take you by the hand, step by step, to examine a "simple" example of such time series. Figure 1 below shows the raw data, where we can readily see several outliers occuring at periods 24 to 28. Imagine, for illustrative purposes, that each point represents the number of unique daily visitors to a blog and that a hit storm occured at periods 24-28 after a post made it to Digg or some other high traffic referrer. I've uploaded the raw data (and procedures to estimate trend lines) here.

Figure 1: raw data

S1_2 

For the moment I will simply try to fit a linear trend line. Visual inspection will lead us to think that the trend line should stay close to the 175-190 range, i.e. slicing through the (seemingly) flat baseline. However, the hit storm will complicate matters.

My question in the first post was: what would you suggest to automate the process of dealing with outliers? Let me explain a bit further.

Figure 2 shows three estimates of the trend line. The first one (in red) shows the regression line that you'll get with an OLS procedure. The slope is negative, which would be interpreted as as sign that the blog draws less and less visitors as time passes.

Figure 2: Three estimations: OLS (biased), M-reg and OLS with 2 intercepts (OK)

S12

This result is suspect, but in this case, fairly obvious -- the OLS trend is created by the hit storm. Had the storm occured in period 60 or so, the trend would have been positive. (OLS is slicing through all data points. For those more technically inclined, OLS assumes that errors are randomly distributed, which is NOT the case here, very obviously. Simply put, OLS regression yields an inappropriate signal because the data is affected by a significant anomaly).

There are (at least) two ways in which one can deal with this situation: one can use robust regression (Huber 1981: Robust Statistics, Wiley) a general procedure where observations are weighted in proportion of their fit to the regression line. Or we can model the hit storm explicitly. Robust regression (M-regression) is general because we do not have to pre-specify which data points are outliers. The iterative weighting takes care of that, which permits automated cleaning, so to speak. A specified OLS regression requires that the analyst adds a dummy variable to indicate that an observation belongs to an outlying group. Feasible in this case, but impractical in others I will introduce in future posts.

The grey (OLS with 2 intercepts + 1 trend parameter) and green line (m-regression with one intercept and one trend parameter) are (1) pratically identical, (2) intuitively better fits than the OLS line as they follow nicely the baseline and (3) positive rather than negative.

Differences between OLS and OLS-2/robust regressions may seem trivial, but let me assure you that they are not. First, the essential conclusion is the opposite (the trend is negative if we rely on OLS, positive if we rely on robust regression). Second, the robust regression highlights the fact that the outliers are not meaningfully related to the bulk of the data (obviously!) which makes us focus on the relevant data range, displayed in figure 3. You may not believe it at first, but figure 3 shows the exact same data as figure 2, albeit it truncates outlying values. 

Figure 3: A closer look at the data

S13

Looking at the data with a proper range shows how inadequate the OLS estimate really was. In figure 3, we display the hit counts in the 150-190 range, ignoring the hit storm (which pushed hits in the 750 range). The red line (OLS trend estimate) does not track the data at all. The M-reg and OLS-2 are still indistinguishably close and track the data "nicely". In other words, using an appropriate technique matters enormously. (arguably more so when I'll move to non-linear trends, were the impact of random shocks is more severe).

And notice what happens in the vicinity of the storm. What looked like a flat line in figure 1 or 2 shows a typical cycling pattern, with a steep climb prior to the storm, and a trough following the storm.

So, for now, my belief is that a robust non-linear regression model would do the job. My belief is fragile however, as I have not tested it against other artefacts that I know do occur in the time series I've examined. As of today I would bet that robust estimates will do well with random outliers, but am not so shure in the case of steps -- so I ask again: if you are familiar with smart procedures that can spot patterns in time series and that could be used in conjunction with non linear regression models, let me know.   

April 05, 2007

Cleaning up time series

Here's is chart depicting a time series. Imagine the number of hits on a webpage, the volume traded on the stock exchange for a given stock, or a metric of the kind.

My immediate goal is to isolate trend data for further analysis. In order to do this properly, I must take into account the "shock" (say, a hit storm) that occurs at periods 24-28. I can then interpolate or simply ignore the interval.

The funny thing is that I am not aware of tools or techniques that would do this automatically. I am not an expert in the field, but not uneducated either. (familiar with most procedures found in TSP).

The chart to which I refer above is the simplest example I found in a sample of about 10 I just examined. A little more involved case is found here. In this second example, there is evidence that the series moves from a first state (A), stable and low intensity, to a second state (B) trending upwards, perturbed by two impulses (starting at period 49 and 60), where an impulse differs from a shock in that a shock adds a constant to the trend, whereas the impulse shifts data (say, hits that would have occurred on period 54 are moved forward to period 50, as if a promotional event had displaced demand).

If anyone out there is aware of "smart" procedures that would hilight events in order to extract trend data, I would appreciate pointers. (just in case -- a regression will not do, because the resulting trend data would have spread disturbances on a prediction line, and the predicted data would become useless anyway precisely because I want to run a special form of non linear model on these inputs).

If anyone is interested is one way or another, please let me know as well. I'll put relevant posts under the "project: mapping time series" category.

(an example is detailed here. Click here for an index of all posts on this thread)

Recherche:


  • pour s'abonner
    Add to Google

kiva