{"id":2366,"date":"2014-02-07T10:06:55","date_gmt":"2014-02-07T15:06:55","guid":{"rendered":"http:\/\/2d823b65bb.nxcli.io\/?p=2366"},"modified":"2014-02-07T10:09:38","modified_gmt":"2014-02-07T15:09:38","slug":"the-words-democrats-and-republicans-use","status":"publish","type":"post","link":"https:\/\/www.robweir.com\/blog\/2014\/02\/the-words-democrats-and-republicans-use.html","title":{"rendered":"The Words Democrats and Republicans Use"},"content":{"rendered":"<p>It came to me after listening to the State of the Union Address:\u00a0\u00a0 Can we tell whether a speech was from a Democrat or a Republican President, purely based on metrics related to the words used?\u00a0 It makes sense that we could.\u00a0 After all, we can analyze emails and detect spam that way.\u00a0 Automatic text classification is a well known problem.\u00a0\u00a0 On the other hand, presidential speeches go back quite a bit.\u00a0 Is there a commonality of speeches of, a Democrat in 2014 with one from 1950?\u00a0 Only one way to find out&#8230;<\/p>\n<p>I decided to limit myself to State of the Union (SOTU) addresses, since they are readily available, and only those post WW II.\u00a0 There has been a significant shift in American politics since WW II so it made sense, for continuity, to look at Truman and later.\u00a0\u00a0 If I had included all of Roosevelt&#8217;s twelve (!) SOTU speeches it might have distorted the results, giving undue weight to individual stylistic factors. \u00a0 So I grabbed the <a href=\"http:\/\/stateoftheunion.onetwothree.net\/texts\/index.html\">71 post WWII addresses<\/a> and stuck them into a directory.\u00a0 I included only the annual addresses, not any exceptional ones, like G.W. Bush&#8217;s special SOTU in September 2001.<\/p>\n<p>I then used R&#8217;s text mining package, <a href=\"http:\/\/cran.r-project.org\/web\/packages\/tm\/index.html\">tm<\/a>, to load the files into a corpus, tokenize, remove punctuation, stop words, etc.\u00a0 I then created a document-term matrix and removed any terms that occurred in fewer than half of the speeches.\u00a0 This left me with counts of 610 terms in 71 documents.<\/p>\n<p>Then came the fun part.\u00a0 I decided to use Pointwise Mutual Information (PMI),\u00a0 an information-centric measure of association from information retrieval, to look at the association between terms in the speeches and party affiliation.\u00a0 PMI shows the degree of association (or &#8220;co-location&#8221;) of two terms while also accounting for their prevalence of the terms individually.\u00a0 Wikipedia <a href=\"http:\/\/en.wikipedia.org\/wiki\/Pointwise_mutual_information\">gives the formula<\/a>, which is pretty much what you would expect.\u00a0\u00a0 Calculate the log probability of the co-location and subtract out the log probability of the background rate of the term.\u00a0 But instead of looking at the co-occurrence of two terms, I tried looking at the co-occurrence of terms with the party affiliation.\u00a0\u00a0\u00a0 For example, the PMI of &#8220;taxes&#8221; with the class Democrat would be:\u00a0 log p(&#8220;taxes&#8221;|Democrat) &#8211; log p(&#8220;taxes&#8221;).\u00a0 You can see <a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/attachments\/pmi\/sotu_pmi.R\">my full script<\/a> for the gory details.<\/p>\n<p>Here&#8217;s what I got, listing the 25 highest PMI terms for Democrats and Republicans:<\/p>\n<p><img decoding=\"async\" alt=\"\" src=\"https:\/\/2d823b65bb.nxcli.io\/blog\/attachments\/pmi\/democrats.png\" \/><\/p>\n<p><img decoding=\"async\" alt=\"\" src=\"https:\/\/2d823b65bb.nxcli.io\/blog\/attachments\/pmi\/republicans.png\" \/><\/p>\n<p>So what does this all mean?\u00a0 First note the difference in scale.\u00a0 The top Republican terms had higher PMI than the top Democrat terms.\u00a0 In some sense it is a political Rorschach test.\u00a0 You&#8217;ll see what you want to see.\u00a0 But in fairness to both parties I think this does accurately reflect their traditional priorities.<\/p>\n<p>From the analytic standpoint the interesting thing I notice is how this compares to other approaches, like using classification trees.\u00a0 For example, if I train the original data with a recursive partitioning classification tree, using rpart, I can classify the speeches with 86% accuracy by looking at the occurrences of only two terms:<\/p>\n<p><img decoding=\"async\" alt=\"\" src=\"https:\/\/2d823b65bb.nxcli.io\/blog\/attachments\/pmi\/rpart.png\" \/><\/p>\n<p>Not a lot of insight there. It essentially latched on to background noise and two semantically useless words.\u00a0\u00a0 So I prefer the PMI-based results since they appear to have more semantic weight.<\/p>\n<p>Next steps: I&#8217;d like to apply this approach back to speeches from 1860 through 1945.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It came to me after listening to the State of the Union Address:\u00a0\u00a0 Can we tell whether a speech was from a Democrat or a Republican President, purely based on metrics related to the words used?\u00a0 It makes sense that we could.\u00a0 After all, we can analyze emails and detect spam that way.\u00a0 Automatic text [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_genesis_hide_title":false,"_genesis_hide_breadcrumbs":false,"_genesis_hide_singular_image":false,"_genesis_hide_footer_widgets":false,"_genesis_custom_body_class":"","_genesis_custom_post_class":"","_genesis_layout":"","footnotes":""},"categories":[155,217],"tags":[],"class_list":{"0":"post-2366","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-language","7":"category-r","8":"entry"},"_links":{"self":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts\/2366","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/comments?post=2366"}],"version-history":[{"count":16,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts\/2366\/revisions"}],"predecessor-version":[{"id":2383,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts\/2366\/revisions\/2383"}],"wp:attachment":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/media?parent=2366"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/categories?post=2366"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/tags?post=2366"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}