{"id":2155,"date":"2013-05-03T11:01:59","date_gmt":"2013-05-03T15:01:59","guid":{"rendered":"http:\/\/2d823b65bb.nxcli.io\/?p=2155"},"modified":"2013-05-08T10:12:14","modified_gmt":"2013-05-08T14:12:14","slug":"mapping-apache","status":"publish","type":"post","link":"https:\/\/www.robweir.com\/blog\/2013\/05\/mapping-apache.html","title":{"rendered":"Mapping the Apache Software Foundation"},"content":{"rendered":"<p><a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/images\/apache-map-large.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" alt=\"\" src=\"https:\/\/2d823b65bb.nxcli.io\/blog\/images\/apache-map.png\" width=\"800\" height=\"800\" \/><\/a><\/p>\n<p>So, what do we have here?\u00a0\u00a0 This is a graph of Apache projects and how they are related, by one definition of &#8220;related&#8221; in any case.\u00a0 Click on the image for a larger PNG version, or<a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/images\/apache-map.svg\"> here if you would like an SVG<\/a>.<\/p>\n<p>Each labeled circle (node) in the graph represents one project at Apache.\u00a0 Or to be specific it represents the membership of a single Project Management Committee (PMC),\u00a0 the leadership committee that each Apache project has.\u00a0 The size of the node is proportionate to the size of the PMC.\u00a0\u00a0\u00a0 You can see that the largest PMCs are Apache Axis (56 members),\u00a0 Httpd (55 members), Subversion (42 members), WS (41 members) and Geronimo (also 41 members).<\/p>\n<p>The edges between the PMC nodes represent the ties between the PMCs as revealed by overlapping membership.\u00a0 So PMCs that have a larger number of members in common have a thicker line connecting them.\u00a0 I used the <a href=\"http:\/\/en.wikipedia.org\/wiki\/S%C3%B8rensen%E2%80%93Dice_coefficient\">S\u00f8rensen\u2013Dice coefficient<\/a> to express the overlap.\u00a0 This is a simple calculation that looks at the overlap in membership of two sets, scaled by the size of the individual sets.\u00a0 It varies from 0 to 1,\u00a0 with 0 meaning no overlap at all and 1 meaning total overlap. \u00a0\u00a0 An example:\u00a0 Look at the bottom of the graph at the thick line connecting Apache Flume and Sqoop.\u00a0 The Flume PMC has 20 members and the Sqoop PMC has 13.\u00a0 They have 6 members in common, so the Dice coefficient is (2*6)\/(20+13) = 0.36.\u00a0\u00a0 The highest weight edge in the graph is that between Apache Httpd and the Apache Portable Runtime (APR), with a coefficient of 0.52.<\/p>\n<p>(Observant Apache participants will note that the chart is missing some PMCs.\u00a0 I omitted Apache Labs, Incubator and Attic since they are umbrella projects representing parts of a project lifecycle.\u00a0 They don&#8217;t have a specific technical orientation and the commonality in membership would not mean anything.\u00a0 I left out Comdev as well, for the similar reasons.)<\/p>\n<p>The color for each node was determined by a community-detection algorithm (modularity) which finds projects that have a high degree of interconnection.\u00a0 This has brought out some of the larger trends within Apache, such as the grouping of cloud-related projects, big data related ones, content management,\u00a0 enterprise middleware, etc.\u00a0 What is interesting is that this graph was created without knowing anything at all about the technology within each project.\u00a0 The graph is based on PMC membership data only.\u00a0 So individual volunteers, by their choice of what projects they work, is the motive force behind these groupings.<\/p>\n<p>Some other interesting facts:<\/p>\n<ul>\n<li>The PMCs with connections to the most other PMCs are Commons (34), WS (32), DirectMemory (31), Aries (28) and Geronimo (28).<\/li>\n<li>If you look at the most connections to other PMCs (subtly different from the above since it is possible to have more than one member in another PMCs) the top projects are: DirectMemory, Karaf, Servicemix, BVal and Geronimo.<\/li>\n<li><a href=\"http:\/\/en.wikipedia.org\/wiki\/Betweenness_centrality\">Betweeness centrality<\/a> looks at the importance of a node with respect to helping connect other nodes.\u00a0 It looks at the shortest path between all pairs of nodes, and which specific nodes are most often the ones that are passed through on these shortest paths.\u00a0 If we were looking at a graph of air traffic routes, the hub cities would be the ones with the highest centrality.\u00a0 If we were looking at how to communicate an idea, influence opinion, or to spread an infectious\u00a0 disease (all the same thing, really), these central nodes are ones to look at.\u00a0 The PMCs at Apache with the highest betweeness are: Commons, DirectMemory, WS, Httpd and Portals.<\/li>\n<\/ul>\n<p>So how did I do this?<\/p>\n<p>The core data I got from scraping this page, which lists <a href=\"http:\/\/people.apache.org\/committer-index.html\">all Apache committers<\/a>.\u00a0 I did this in Python using BeautifulSoup, building up the PMC membership in a dictionary.\u00a0 Then Python&#8217;s set operations made calculating the Dice coefficient a simple task:<\/p>\n<div>\n<pre>    intersect = SetA.intersection(SetB)\r\n\r\n    dice = (2.0*len(intersect)\/(len(SetA)+len(SetB)))<\/pre>\n<\/div>\n<p>The script then wrote out the graph data, include node size and edge weight into a Gexf-format XML file, which I then processed using <a href=\"https:\/\/gephi.org\/\">Gephi<\/a>.\u00a0 Here&#8217;s <a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/attachments\/apache.gexf\">the data file I used<\/a> if you want to play with the data yourself.<\/p>\n<p>In Part II of this series, I&#8217;ll take a look at finer-grained data, at <a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/2013\/05\/mapping-the-asf-part-ii.html\">the social network graph of Apache Software Foundation participants at the individual level<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>So, what do we have here?\u00a0\u00a0 This is a graph of Apache projects and how they are related, by one definition of &#8220;related&#8221; in any case.\u00a0 Click on the image for a larger PNG version, or here if you would like an SVG. Each labeled circle (node) in the graph represents one project at Apache.\u00a0 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_genesis_hide_title":false,"_genesis_hide_breadcrumbs":false,"_genesis_hide_singular_image":false,"_genesis_hide_footer_widgets":false,"_genesis_custom_body_class":"","_genesis_custom_post_class":"","_genesis_layout":"","footnotes":""},"categories":[211,213],"tags":[],"class_list":{"0":"post-2155","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-apache","7":"category-social-network-analysis","8":"entry"},"_links":{"self":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts\/2155","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/comments?post=2155"}],"version-history":[{"count":6,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts\/2155\/revisions"}],"predecessor-version":[{"id":2168,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts\/2155\/revisions\/2168"}],"wp:attachment":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/media?parent=2155"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/categories?post=2155"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/tags?post=2155"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}