Oct 17, 2018

The Most Beautiful Social Network: the Structure of Communication at Automattic

In the previous two HR data analysis posts we dove into the tools we use to communicate as remote workers of a large-ish company at Automattic.

In How Communication Density Fuels Automattic, we looked into how our once-a-year company retreat, the Grand Meetup, impacts our work and how we communicate about it, and found that the meetup increases our productivity and cohesiveness.
We then turned our raw communication data into a network in Analysis of A Beautiful Storm: Internal Communication at Automattic, and analyzed its influential nodes, the number of clusters we form within the company, and showed how Matt, our CEO, is the center of communication flows at Automattic.

full_work_net_anonymized_opacity_up — Interaction network of Automattic employees. Source and more information: https://data.blog/2018/02/01/analysis-of-a-beautiful-storm-internal-communication-at-automattic/

At the end of the previous post, I promised that we would dive deeper into the above network:

A network analysis like this also gives us opportunities to figure out how to best inform everyone of big changes in a fast way, whether our communication structure is robust enough to account for people taking longer leaves, or what the characteristics of above-average communicators at our company are. These are all things we have looked into, stay tuned for subsequent posts on our HR analyses!

Social Network Analysis (SNA) basics

We at Automattic create and engage with content as part of our jobs. We mainly use P2s, our network of internal blogs, to communicate with each other. We post about everything from new projects to meetup activities we’re planning, or A/B test results. Everyone in the company is invited to join the conversation that follows any P2 post.

When we drew the network of Automatticians previously, each node in the network represented a person. And every time a person liked or commented on another’s post, a line was drawn between them.

This gave us a social network that is ripe for analysis! When it comes to social network analysis, we’re mainly interested in finding cliques by figuring out if certain nodes are more connected to each other than they are to others (clustering), and quantifying the importance of nodes in the network in different ways (network centralities).

Network centrality metrics

In the previous analysis, we looked at how to find cliques via clustering, and found seven distinct cliques within our Automattician-to-Automattician network. We also calculated each person’s PageRank within the network and found that Matt, our CEO, has the highest PageRank score.

PageRank is only one of the many metrics we can use to rank nodes in a network. Other straightforward approaches include:

In-degree centrality: The number of incoming connections a node has. For example, if 100 distinct people liked or commented on a person’s post, then their in-degree would be 100.
Weighted in-degree centrality: This, in our case, is the number of interactions a person receives. For example, if a person’s post receives 1,000 likes and comments, then their weighted in-degree is 1,000.
Weighted out-degree centrality: This is the number of times a person interacts with others within the company.

Betweenness centrality

One of the biggest advantages of a network-driven approach is that it gives a way to look at data in a structure that makes different cliques visible. An interesting thing about cliques is that they have easy access to information flow between nodes within their own clique, but two cliques might not know much about the happenings in the other clique, since information doesn’t flow so easily between cliques. There are always nodes in networks that act as bridges between these different groups; between whom the information flow is otherwise difficult. There is a metric that shows each node’s capability to act as a bridge. This metric is called betweenness centrality.

A node might have very few connections, but could hold a very important position within the network in terms of connecting cliques/sub-cliques that might otherwise be separated from each other. Sometimes it makes sense to target nodes that act as so-called “network bridges,” and this is when betweenness centrality comes into play.

A node’s betweeness centrality is the number of shortest paths the node falls into when the shortest paths between all node pairs in the network is calculated.

Fig16_betweeness_centrality-500x209 — Source: https://www.ebi.ac.uk/training/online/course/network-analysis-protein-interaction-data-introduction/building-and-analysing-ppins-3

Closeness centrality

While a node may be connecting different clusters together (high betweenness), or might have many connections (high in-degree) or a fewer but important connections (high PageRank), in some cases, the node’s distance – its ability to quickly be infected and infect others – to other nodes might be more important in terms of its influence.

These nodes should meet the criteria of being tied to many important nodes in the network. Generally this will also result in high betweenness, and the measure of distance to other nodes also gives us important information about its position in the network. A user might have high-degree centrality but also low closeness, which would mean that he/she has many ties but they all come from the same cluster that is at the periphery of the network.

Closeness centrality measures the number of steps needed to access every other node from the given node.

Distribution of network metrics

One of the properties of real-world networks is that their degree and most of the network centrality distributions will have a long-tail. This means that there will be a few nodes in the network with very high centralities, and the large majority of the nodes in the network will have very low centralities.

Sometimes, the degree distribution of a network will follow the power law; these are called scale-free networks. The 80/20 rule, also called Pareto’s Law, may apply in these networks; 20% of the nodes in the network will be responsible for 80% of the analyzed values in the network.

An example power-law graph, being used to demonstrate ranking of popularity. To the right is the long tail, and to the left are the few that dominate (also known as the 80–20 rule). Source: https://en.wikipedia.org/wiki/Power_law

SNA powered change management

Change management deals with the question of: How do we best and most effectively communicate changes within an organization?

Since betweenness centrality ranking gives us a list of nodes who connect different cliques together, we can use these people to help us make sure that the new piece of information reaches even the more isolated clusters within the network.

This provides a fast way to introduce change, since betweenness centrality distributions usually do have a long-tail with only very few people with high betweenness; this means that we only need to reach those that have high betweenness to help them facilitate introducing the new information to the rest of the network.

pr_btw_n — Betweeness centrality distribution of people working at Automattic. Only very very few people have high betweenness.

SNA powered robustness management

Another thing we would like to make sure as an organization is that each person has access and knowledge of the happenings at the company — that our information flow is robust. Here, we would not like to have a very few people that have access to more, and majority having access to little.

Fortunately, when we plotted the closeness centrality distribution of our communication network, we found that the distribution is normal.

close_histo_n — Closeness centrality distribution of people working at Automattic.

As explained, closeness centrality is really important — if this network was mapping knowledge transfer, a beautiful normal closeness centrality distribution like ours would mean that we don’t have many people whose sabbatical, for example, would disrupt the company’s workflow. (At Automattic, we receive a three-month sabbatical every five years; this could be disruptive at a less connected or less open organization.)
Why? Because our closeness centrality — which describes how many steps it takes to reach every other person from the given person — follows a normal distribution — there aren’t many outliers. So, for most people, it is true that there are others in the network with very similar closeness to the knowledge base as their distance.
In this case, it means that we interact with and are aware of things at the company in a way that ensures that if a node stopped its interactions, we would all still have similar awareness as before. We do not rely on a few nodes for the majority of our information.

tl;dr

Transforming internal communication data into a social network gives a way to uncover hidden structures within how we work and communicate.
We are able to rank people in the network based on their ability to act as a bridge via betweenness centrality and this helps in change management.
- There are usually very few people that have high betweenness.
- These high-ranking people will be important to reach when trying to introduce a new piece of information into the network quickly.
Another measure of ease of information diffusion is closeness centrality, which helps with understanding one form of robustness.
- This measures the speed with which each node is “infected.”
- We want to make sure that our network is robust enough so that if a high-ranking person goes on sabbatical, there will be others with similar closeness to them.

Network Science

Communication, Data Science, Data Visualization, Network Analysis, Remote Work

Comments

Introducing pipe, The Automattic Machine Learning Pipeline – Data for Breakfast

8 years ago

[…] I had the autonomy and freedom to delve deep into topics of my choice, which at the time revolved around uncovering the networks hiding within our communities using network science. […]

Reply