• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

An Antic Disposition

  • Home
  • About
  • Archives
  • Writings
  • Links
You are here: Home / Archives for Social Network Analysis

Social Network Analysis

Visualizing OASIS Technical Committees

2013/07/01 By Rob 1 Comment

oasis

So what do we have here?  This is a simple social network visualization, of OASIS Technical Committees.  Each circle in this graph represents a single Technical Committee (TC).  The size of the circle is proportionate to how many members are on the committee.  The lines between the committees have a weight that is proportionate to the overlap in membership between the TCs.  In this case I used Dice’s coefficient as a metric, although any of the several set similarity metrics (Jaccard, etc.) would work here.  The color of each node represents the modularity class, a measure of communities or sub-networks within the graph.  The resulting graph was then run through Gephi and its Force Atlas layout algorithm ,  which brings together the TCs that are more closely related by overlapping membership.   Click the image for a larger version.

(For those who are interested, the raw data for this is all publicly available,  on the OASIS website.  Scraping the webpages for the data, calculating the graph and outputting a GEXF format file for Gephi was accomplished in 133 lines of Python.)

Note one important fact:  the graph is formed entirely on abstract concepts, the size of each committee and the overlaps in membership.  It has no knowledge of what the underlying technologies are, the companies and individuals involved, or of other items of semantic value that could describe the work of the committee.   The structure is essentially based on the interests and affiliations of individual committee members.  Where there is common interest it is assumed that there is commonality in the work of the TCs.

So how well does this match reality?   The image that follows (click for an enlarged version) is the same chart, but with each node labeled by the short name of the TC.    As you can see, the above approach does a fine job bringing together related TCs.  This occurs both at the fine-grained level, where the DITA TC and the DITA Adoption TC, or the SCA and SCA Assembly TCs are adjacent, and it also applies at the broader level, where we see communities for content-related standards, for privacy/identity standards, legal/emergency, etc.

oasis-projects

  • Tweet

Filed Under: OASIS, Social Network Analysis

Mapping the ASF, Part II

2013/05/06 By Rob 1 Comment

In my last post I showed you one view of the Apache Software Foundation, the relationship of projects as revealed by the overlapping membership of their Project Management Committees.  After I did that post it struck me that I could, with a very small modifications to my script, look at the connections at the individual level instead of at the committee level.  Initially I attempted this with all Committers in the ASF   This resulted in a graph with over 3000 nodes and over 2.6 million edges.   I’m still working on making sense of that graph.  It was very dense and visualizing it as anything other than a giant blob has proven challenging.  So I scaled back the problem slightly and decided to look at the relationship between individual members of the many PMCs, a smaller graph with only 1577 nodes and 22,399 edges.

Here’s what I got:


As before I excluded the Apache Incubator, Labs and Attic, but looked at all other PMC members.  Each PMC member is a dot in this graph, with a line connecting two people who serve together on a PMC.  The layout and colors emphasizes communities of strong interconnection.  An SVG version of the graph is here.

Each PMC is a “clique”, a group that strongly interacts with itself.  But aside from a small number of exceptions, which you can see at the top of the graph, each PMC has one or more members who are also members of other PMCs.    In structural terms they are “between” the two communities and help connect them.  This could mean various things in social terms, from acting as a conduit of information, a broker, or even a gatekeeper.  The person who introduces you to new people at a party serves the same role as the person who tells the prisoner stories of the outside world.  The context is different, of course, but in either case, the structural position is one of importance.

A common way of quantifying the importance of the nodes that connect other nodes, is via a metric called “betweenness centrality“, which you can think of as a measure of how many shortest paths between other nodes pass through that node.  If the shortest path is always going through you, then you have high betweenness and you’re helping connecting the disparate parts of the organization.

Let’s draw the graph again and show each node with a size proportionate to its betweenness.  You can see more clearly now the position of the high betweenness nodes and how they bridge sub-communities.

Now of course, the structural role doesn’t necessarily equate to the actual social role.  Someone could be inactive or lurking in multiple projects and not serve as the conduit of much of anything, though on paper they appear central.   But Apache participants might take a look at this larger version of the chart, where I have labeled the nodes, and see how well it matches reality in many ways.

  • Tweet

Filed Under: Apache, Social Network Analysis

Mapping the Apache Software Foundation

2013/05/03 By Rob 2 Comments

So, what do we have here?   This is a graph of Apache projects and how they are related, by one definition of “related” in any case.  Click on the image for a larger PNG version, or here if you would like an SVG.

Each labeled circle (node) in the graph represents one project at Apache.  Or to be specific it represents the membership of a single Project Management Committee (PMC),  the leadership committee that each Apache project has.  The size of the node is proportionate to the size of the PMC.    You can see that the largest PMCs are Apache Axis (56 members),  Httpd (55 members), Subversion (42 members), WS (41 members) and Geronimo (also 41 members).

The edges between the PMC nodes represent the ties between the PMCs as revealed by overlapping membership.  So PMCs that have a larger number of members in common have a thicker line connecting them.  I used the Sørensen–Dice coefficient to express the overlap.  This is a simple calculation that looks at the overlap in membership of two sets, scaled by the size of the individual sets.  It varies from 0 to 1,  with 0 meaning no overlap at all and 1 meaning total overlap.    An example:  Look at the bottom of the graph at the thick line connecting Apache Flume and Sqoop.  The Flume PMC has 20 members and the Sqoop PMC has 13.  They have 6 members in common, so the Dice coefficient is (2*6)/(20+13) = 0.36.   The highest weight edge in the graph is that between Apache Httpd and the Apache Portable Runtime (APR), with a coefficient of 0.52.

(Observant Apache participants will note that the chart is missing some PMCs.  I omitted Apache Labs, Incubator and Attic since they are umbrella projects representing parts of a project lifecycle.  They don’t have a specific technical orientation and the commonality in membership would not mean anything.  I left out Comdev as well, for the similar reasons.)

The color for each node was determined by a community-detection algorithm (modularity) which finds projects that have a high degree of interconnection.  This has brought out some of the larger trends within Apache, such as the grouping of cloud-related projects, big data related ones, content management,  enterprise middleware, etc.  What is interesting is that this graph was created without knowing anything at all about the technology within each project.  The graph is based on PMC membership data only.  So individual volunteers, by their choice of what projects they work, is the motive force behind these groupings.

Some other interesting facts:

  • The PMCs with connections to the most other PMCs are Commons (34), WS (32), DirectMemory (31), Aries (28) and Geronimo (28).
  • If you look at the most connections to other PMCs (subtly different from the above since it is possible to have more than one member in another PMCs) the top projects are: DirectMemory, Karaf, Servicemix, BVal and Geronimo.
  • Betweeness centrality looks at the importance of a node with respect to helping connect other nodes.  It looks at the shortest path between all pairs of nodes, and which specific nodes are most often the ones that are passed through on these shortest paths.  If we were looking at a graph of air traffic routes, the hub cities would be the ones with the highest centrality.  If we were looking at how to communicate an idea, influence opinion, or to spread an infectious  disease (all the same thing, really), these central nodes are ones to look at.  The PMCs at Apache with the highest betweeness are: Commons, DirectMemory, WS, Httpd and Portals.

So how did I do this?

The core data I got from scraping this page, which lists all Apache committers.  I did this in Python using BeautifulSoup, building up the PMC membership in a dictionary.  Then Python’s set operations made calculating the Dice coefficient a simple task:

    intersect = SetA.intersection(SetB)

    dice = (2.0*len(intersect)/(len(SetA)+len(SetB)))

The script then wrote out the graph data, include node size and edge weight into a Gexf-format XML file, which I then processed using Gephi.  Here’s the data file I used if you want to play with the data yourself.

In Part II of this series, I’ll take a look at finer-grained data, at the social network graph of Apache Software Foundation participants at the individual level.

  • Tweet

Filed Under: Apache, Social Network Analysis

Primary Sidebar

Copyright © 2006-2023 Rob Weir · Site Policies

 

Loading Comments...