Personal
Blog

The DNA of an Internet Forum

In my job I constantly find myself confronted with huge volumes of data that I need to quickly familiarize myself with. Occasionally I come across internet forums. Quickly getting an idea of the content of such a forum can be challenging. There may be tens of thousands of messages spanning hundreds of subjects. Sometimes the messages are in Russian. No, in Russian slang. No, in Russian slang in cyrillic script mixed with some cyrillic that is actually phonetic English with an odd accent. You get the picture? I have seen seasoned, native speaking translators staring at the messages in utter confusion.

Besides the subject there are other questions to be answered. Who are the members of the forum? Who is discussing what subject? Are there any key members that often jointly discuss certain things? A reading guide of some sort would be nice.

Circos

Recently I came across a visualization tool for genome sequences called Circos. This tool is used by bioinformatics scientists to visualize genomic data and compare particular sequences. It can show vast quantities of data in a way that eases visual discovery of patterns and relations. Technically, it is similar to GraphViz, which is another favourite tool of mine. Just generate a set of simple text files containing the data that you wish to show. Then run the tool, and it outputs a nice SVG graphic for you.

Seeing this tool I started thinking about visualizing the DNA of an internet forum. After toying with it for a day and a half the following picture appeared on my screen:

No, this is not a Hubble Space Telescope image of a distant galaxy. This is what the structure of a fairly large internet forum can look like.

Structure

But what are we seeing here exactly? Let us zoom in a bit more to see its structure:

The image is constructed around a circle that has been divided into two halves. The left half shows the names of some forum members, the right half shows titles of sub-forums that are dedicated to a particular subject. In case you were wondering: Indeed these are not real names and titles. I used some dummy data here.

The ticks on both halves count messages. Each forum message is shown twice. Once on the left side, grouped by the member who posted it. And then once more on the right, grouped by the sub-forum in which it appears. Within a single group representing a forum member the messages are sorted on the time when they were posted. Within a sub-forum the messages are ordered by the threads in which they appear.

The Magic

The magic happens when we use colored lines to connect the same message on both sides. This yields colorful "beams" that show the focus of a particular member on specific subjects. We can also see when focus shifts over time. A detail of these beams, slightly rotated, is shown below:

I will immediately admit that I show this detail just because it's so pretty. Up close the line patterns look like the rings of Saturn. Now imagine what happens when animating this image by using a sliding time window to filter the data set. Unfortunately I managed to resist the urge to actually generate an animation. Just use your imagination!

Tile Track

Circos features various types of data tracks that can be added as rings on either the inner or outer side of the circle. I used a so-called tile track to add some more detail to the visualization. You can see it below in between the ticks track and the label track:

The tile track shows colored tiles stacked on top of one another. The colors match the colors of the members on the left half. Each tile covers a single thread and shows which of the members have posted in that thread. When multiple members posted in the same thread the tiles stack up. This shows which of the members frequently "meet" in a thread and which threads contain discussions among the displayed members. Depending on how you choose which members to include in the graphic these may be interesting threads to start exploring the forum.

This post was originally not written in this exact form. It originates from personal notes that have been redacted in order to make them suitable for publication.