Wiki Analytics
hideAre there algorithmic ways of determining the health of a Wiki?
There are likely a number of different patterns of healthy Wikis and, more importantly, healthy Wiki-based communities. If we can identify and visualize these patterns, we can apply these analytics to:
- Understand the patterns of interactions in a healthy community
- Aid the community to use the Wiki more effectively
- Encourage developers to facilitate these patterns in the tool itself
Philosophy
The end goal is not to come up with some single index indicating health or effectiveness, but to identify patterns. Communities can derive their own meaning from these patterns and act appropriately.
You can get infinitely intricate in the complexity of your metrics, and at some point, that may be valuable. As a starting point, try to identify simple metrics. Simple metrics means simplifying the data acquisition and computation requirements. The simpler these requirements, the more likely people will examine the data.
Analytics
Page Names
Hypothesis: Good page names are one indicator of healthy Wikis. The better the names, the more likely people will link to those pages, both intentionally and accidentally.
How do you measure the "goodness" of a page name?
- Number of characters
- Number of words/tokens
- Number of non-alphanumeric characters
The hypothesis for all of the above is that smaller is better.
Other potential analytics:
- Variation in normalized link names. Some Wikis normalize page names. For example, "Matt Liggett" and "matt liggett" might point to the same page. Other forms of normalization include treating non-alphanumeric characters as white space. Studying the variation in the text actually used to link to pages would demonstrate the effectiveness of the normalization algorithms.
Link (Graph) Analysis
- Number of Islands/Orphans. If no pages are linked to anything else, then every page is an island of one, and you are probably not using the Wiki in a useful way. Islands consisting of several pages ("components" in graph theory) indicate some level of interconnectedness.
- Number of Blocks/Peninsulas. Blocks are pages only connected to one other page. If you break that link, the page becomes an orphan, or island.
- Level/pattern of interconnectedness of clusters.
- Diameter (longest path in the graph). A long diameter might be an unhealthy indicator. Measuring diameter is NP-complete, so it's not practical as a general metric.
- Number of links to and from a page. If you graph pages (x-axis) and links (y-axis) from largest to small, you may be able to derive interesting usage patterns. For example, the double derivative of the curve might indicate the linking behavior variance in a community.
- How often are external links used to link inside a Wiki?
The hypothesis for an island is that fewer large islands are better than many small islands. One way to verify this would be to cross-relate this data with page name analysis (see above). In other words, do larger islands have better page names?
What constitutes a "link"?
- Forward link
- Backlink
- Internal link (LinkAsYouThink)
- Links to non-existent pages (incipient links)
- External link
- Transclusion
- Tags
Page Content Analysis
- Page size
- Number of sections
- Word Cloud
- Fernanda Viegas and Martin Wattenberg (of History Flow fame) did this analysis on Wikipedia and presented it at Wikimania 2006. Unfortunately, they haven't published it yet because of concerns about privacy.
- Show evolution over time using slider. See U.S. Presidential Speech Clouds for an example.
As with page name analysis, you could also cross-relate this data with the graph analytics.
Time Analysis
Some of the most interesting analysis will be when a time axis is added. This will allow us to understand how content evolves -- how it is refactored (or not), how conflict is resolved, and in general, what the patterns of interaction look like. The best work to date on this is IBM's history flow.
Other things to study:
- Stubs evolving into fleshed out pages
Tags
As discussed in the section on graph analytics, one way to analyze tags is to treat them as links. Another way to study them is to treat them as page names.
Tag-specific analysis:
- Emergent namespacing of tags
Usage Patterns
- How do edits vs accesses reflect health of Wiki?
- How often are orphan pages accessed/edited?
- Most wanted pages.
- Numbers of editors
Datasets
The easiest kinds of data to get from Wikis are content-related: page names, page content, etc. Link data is somewhat harder and is dependent on the Wiki implementation. Access data is the hardest; other than revision history, it is generally not available from the software itself.
The emergence of various Wiki APIs is making it much easier to get data. Convergence among these APIs would make it even easier. One important outcome of this analytics discussion is that it generates new use cases for the types of data one might want to extract via an API.
Wikithon (February 2007)
Eugene Eric Kim and Matthew O'Connor worked on the Wiki Analytics problem at the February 7, 2007 Wikithon held at the Socialtext offices in Palo Alto, California. In addition to the philosophy outlined above, our goal was to have something complete and presentable at the end of the day. We chose to do page name analysis and graph analysis. At the end of the day, we won the coveted prize for best use of the REST API (along with Most wanted pages). (The two winning teams were also the only two valid contestants, but we won't mention that.)
Our process:
- Avoid parsing pages for now
- Decouple data acquisition/data munging
- Get datasets
- intermediate formats
- what data will we get?
- write the number crunchers
- formalize ideas before implementing
- visualizations
Our results:
Datasets
We decided to constrain our datasets to Socialtext content. We studied the 266 public Socialtext Wikis and corp, Socialtext's internal corporate Wiki. We were not able to get a list of public workspaces from the API, so Matthew extracted it from the internal database (the advantage of pairing with a Socialtext employee). All other information came directly from the API.
Socialtext's implementation of its API server limited the types of data we were able to retrieve. Inclusions and link WAFL are not considered links by the API server. We were not able to extract incipient links due to a bug in the API server.
For time purposes, we decided not to do analysis that would require parsing page content. This meant not including external links in our study.
We generated two datasets for each workspaces:
- List of all existing pages
- List of all forward, internal links. We could easily derive backlinks from this data.
The list of all existing pages was simply plain text, with one page name per line. The link list consisted of a source and a destination page, tab-delimited on each line. If we were able to extract more data, we could have included a third column indicating link type (e.g. internal, external, inclusion, etc.).

Presenting Wiki Analytics demo at the Wikithon