Additional Analysis of SEOmoz web popularity data

SEOmoz.org provide some great resources on search engine optimisation (“SEO”). Recently, they performed a really interesting analysis comparing actual site traffic for 25 sites that volunteered their data against indicators from a range of competitive intelligence metrics from sources such as Google PageRank, Technorati Rank, Alexa Rank and SEOmoz.org’s very own Page Strength Tool. The stated goals of the project is described in this quote from their page:

This project’s primary objective is to determine the relative levels of accuracy for external metrics (from sites like Technorati, Alexa, Compete, etc.) in comparison to actual visitor traffic data provided by analytics programs. 25 unique sites, all in the search & website marketing niche, generously contributed data to this project. Through the statistics provided, we can also get a closer look at how the blog ecosphere in the search marketing space receives and sends traffic

You can find the commentary on their updated analysis and also the original article (updated too, I understand).

Now, I’m not yet an expert on SEO, but I do know a few things about data analysis. Whereas their results indicate that none of the measures are particularly useful, I have three points to add:

1 Significance of correlation coefficients

A correlation coefficient does not need to be 0.9 or 0.95 to be significant as mentioned:

Technorati links is actually an almost usable option at this point, though any scientific analysis would tell you that correlations below 90-95% shouldn’t be used.

Roughly speaking, correlation coefficients greater than about 0.7 or 70% explain approximately half the variability in the observed variable (actual page visits). Whether or not this is “significant” depends on the amount of data used to measure the correlation. There are some very specific tests for measures of significance for correlation coefficients – I have summarised the results of one of the standard tests here:

SEOmoz data Correlation Significance Table

Beyond the technical statistical tests though, I would imagine that there is a great deal of value in estimating a large part of the practical popularity of a website (and presumably page visits is a sensible measure of this) through freely available “competitive intelligence metrics”. On the other hand, if you are looking for a near-exact replica of actual visits, then a much higher correlation coefficient is required.

2 Extending analysis to multiple regression rather than single correlations

OK, this does take the analysis beyond the original stated goal, but it is interesting to see how good a model of actual site popularity we can develop based on freely available “competitive intelligence metrics”. But first, it is useful to consider the correlation matrix between all variables (the “dependent variable” and all independent variables). In an ideal regression model, the independent variables will be uncorrelated with each other. On the other hand, if these metrics are any good, we would expect them to be strongly correlated with each other.
SEOmoz data Correlation Matrix
As can be seen from the table above, there are several strong correlations between the independent variables. This can lead to problems with “multicollinearity” for multiple regression technqiues, but since I am trying to keep this post non-technical, I’ll leave that alone for now. It is also interesting that while all the large (loosely defined here as greater than 70% or less than -70%) correlations are positive, there are many negative correlations as well. Thus, some measures appear to be using different information or approaches to provide the metrics. Most interesting to me is that TR Rank and TR Link have a correlation coefficient of -50%. This will be a hint to our multiple regression results…
I decided to use only very basic tools for the analysis so interested readers can perform the same analysis on their own with only MS Excel (generally a fairly weak statistics platform even with the Data Analysis add-in activated). My aim was to find a model that explained more of the Average Visits than Technorati Links by combining several variables together. I had to exclyde Compete Rank and Ranking Rank due to the limitations of Excel’s regression tools. I would measure “good” models by having a high adjusted R-squared, and significant and sensible estimates for individual variables as well. The results of a “good” model (although not necessarily the best since I did fairly quick and dirty model selection) are given below:

SEOmoz data Multiple Regression Results

SEOmoz data Multiple Regression Results Summary

The model has a “Multiple R” (which is intuitively analogous to the normal Pearson correlation coefficient) of 89%, and the model explains 80% of the variability in Average Visits. Other measures of goodness of fit include a high Adjusted R-squared (relative to other models fitted) of 71%, a F-statistic for overall model significance of 9.5 which gives a significance level or p-value of 0.00008 and low p-values for most independent variables included in the model. The intercept itself is not signfiicant, but we leave it in to improve the overall fit of the model. Similarly, while the significance level for Alexa Page Views is relatively high at 17%, it does add to the overall model in terms of fitting the data well.

SEOmoz data Multiple Regression fitted model

Again, very interestingly but not surprising by now, many of the coefficients are negative. This implies that, at least once adjusting for the other variables, these measures are associated with lower rather than higher Average Visits. This suggests more analysis and more data is needed to understand the dynamics here properly!

3 Quality and quantity of data
This leads me to my final comment. 25 Websites, while great to have even this much data, is not really anywhere close enough data to analyse this problem. This isn’t because of the small size of 25 sites in relation to the total available websites on the ‘net, but rather to do with the spread of sites across the different types of websites and the potential to fit the model too closely to the exact data provided rather than to some underlying reality. Again, this is a difficult area to discuss correctly and thoroughly without becoming very technical so I’ll leave that well alone too.

Final comments

This analysis and presentation of results is very lite for something this interesting. There is an enormous amount more that could be done with time, energy, more data, and, for my part, a better understanding of how each of these competitive intelligence metrics are intended to work. I’d welcome any comments on what analysis would be desired (time-series? Non-linear models? More detailed regression? Rank correlation?) and whether there is any chance of getting more data. I’d be very happy to dig deeper and post the results here and/or directly on SEOmoz.org

2 thoughts on “Additional Analysis of SEOmoz web popularity data

  1. Hi David,

    Thanks for the post and the details results. I’ve been through most of this at uni, but it’s great to see it applied in practice. This analysis is all fine and well and interesting, but you don’t surely expect people to collect all the data from the various free sources of web ratings to plug into a model to estimate their web-traffic? It seems like a lot of hard work when they could just use google analytics (http://www.google.com/analytics) to get accurate traffic data?

  2. Thanks Simon.

    You make an excellent point, that probably wasn’t very clear in the post – the analysis wasn’t intended to replace traffic analysis tools or logs for own website. That really would be overengineering the solution completely.

    However, to get a feel for which of the providers of web analytics were saying something useful, and which were saying something useful that was different from all the others, you need to apply a mutlivariate regression approach. In a multivariate approach, each variable only contributes to the total regression what it can add in terms of new information. That’s quite an interesting result in itself.

    But where the tool may have some practical value, is in understanding a competing or complementary site. I hadn’t thought about this while I was doing the work, but it can provide a better, bigger, more transparent window into the success of another sight other than just looking at a single metric. Of course, another excellent source for this is the SEOmoz pagestrength tool.

Comments are closed.