SEOmoz.org provide some great resources on search engine optimisation (“SEO”). Recently, they performed a really interesting analysis comparing actual site traffic for 25 sites that volunteered their data against indicators from a range of competitive intelligence metrics from sources such as Google PageRank, Technorati Rank, Alexa Rank and SEOmoz.org’s very own Page Strength Tool. The stated goals of the project is described in this quote from their page:
This project’s primary objective is to determine the relative levels of accuracy for external metrics (from sites like Technorati, Alexa, Compete, etc.) in comparison to actual visitor traffic data provided by analytics programs. 25 unique sites, all in the search & website marketing niche, generously contributed data to this project. Through the statistics provided, we can also get a closer look at how the blog ecosphere in the search marketing space receives and sends traffic
Now, I’m not yet an expert on SEO, but I do know a few things about data analysis. Whereas their results indicate that none of the measures are particularly useful, I have three points to add:
1 Significance of correlation coefficients
A correlation coefficient does not need to be 0.9 or 0.95 to be significant as mentioned:
Technorati links is actually an almost usable option at this point, though any scientific analysis would tell you that correlations below 90-95% shouldn’t be used.
Roughly speaking, correlation coefficients greater than about 0.7 or 70% explain approximately half the variability in the observed variable (actual page visits). Whether or not this is “significant” depends on the amount of data used to measure the correlation. There are some very specific tests for measures of significance for correlation coefficients – I have summarised the results of one of the standard tests here:
Beyond the technical statistical tests though, I would imagine that there is a great deal of value in estimating a large part of the practical popularity of a website (and presumably page visits is a sensible measure of this) through freely available “competitive intelligence metrics”. On the other hand, if you are looking for a near-exact replica of actual visits, then a much higher correlation coefficient is required.
2 Extending analysis to multiple regression rather than single correlations
OK, this does take the analysis beyond the original stated goal, but it is interesting to see how good a model of actual site popularity we can develop based on freely available “competitive intelligence metrics”. But first, it is useful to consider the correlation matrix between all variables (the “dependent variable” and all independent variables). In an ideal regression model, the independent variables will be uncorrelated with each other. On the other hand, if these metrics are any good, we would expect them to be strongly correlated with each other.
As can be seen from the table above, there are several strong correlations between the independent variables. This can lead to problems with “multicollinearity” for multiple regression technqiues, but since I am trying to keep this post non-technical, I’ll leave that alone for now. It is also interesting that while all the large (loosely defined here as greater than 70% or less than -70%) correlations are positive, there are many negative correlations as well. Thus, some measures appear to be using different information or approaches to provide the metrics. Most interesting to me is that TR Rank and TR Link have a correlation coefficient of -50%. This will be a hint to our multiple regression results…
I decided to use only very basic tools for the analysis so interested readers can perform the same analysis on their own with only MS Excel (generally a fairly weak statistics platform even with the Data Analysis add-in activated). My aim was to find a model that explained more of the Average Visits than Technorati Links by combining several variables together. I had to exclyde Compete Rank and Ranking Rank due to the limitations of Excel’s regression tools. I would measure “good” models by having a high adjusted R-squared, and significant and sensible estimates for individual variables as well. The results of a “good” model (although not necessarily the best since I did fairly quick and dirty model selection) are given below:
The model has a “Multiple R” (which is intuitively analogous to the normal Pearson correlation coefficient) of 89%, and the model explains 80% of the variability in Average Visits. Other measures of goodness of fit include a high Adjusted R-squared (relative to other models fitted) of 71%, a F-statistic for overall model significance of 9.5 which gives a significance level or p-value of 0.00008 and low p-values for most independent variables included in the model. The intercept itself is not signfiicant, but we leave it in to improve the overall fit of the model. Similarly, while the significance level for Alexa Page Views is relatively high at 17%, it does add to the overall model in terms of fitting the data well.
Again, very interestingly but not surprising by now, many of the coefficients are negative. This implies that, at least once adjusting for the other variables, these measures are associated with lower rather than higher Average Visits. This suggests more analysis and more data is needed to understand the dynamics here properly!
3 Quality and quantity of data
This leads me to my final comment. 25 Websites, while great to have even this much data, is not really anywhere close enough data to analyse this problem. This isn’t because of the small size of 25 sites in relation to the total available websites on the ‘net, but rather to do with the spread of sites across the different types of websites and the potential to fit the model too closely to the exact data provided rather than to some underlying reality. Again, this is a difficult area to discuss correctly and thoroughly without becoming very technical so I’ll leave that well alone too.
This analysis and presentation of results is very lite for something this interesting. There is an enormous amount more that could be done with time, energy, more data, and, for my part, a better understanding of how each of these competitive intelligence metrics are intended to work. I’d welcome any comments on what analysis would be desired (time-series? Non-linear models? More detailed regression? Rank correlation?) and whether there is any chance of getting more data. I’d be very happy to dig deeper and post the results here and/or directly on SEOmoz.org