{"id":1124,"date":"2010-07-29T16:57:39","date_gmt":"2010-07-29T20:57:39","guid":{"rendered":"http:\/\/2d823b65bb.nxcli.io\/?p=1124"},"modified":"2010-08-01T21:26:10","modified_gmt":"2010-08-02T01:26:10","slug":"odf-word-clouds","status":"publish","type":"post","link":"https:\/\/www.robweir.com\/blog\/2010\/07\/odf-word-clouds.html","title":{"rendered":"ODF 1.2 Word Clouds"},"content":{"rendered":"<p>I&#8217;ve been playing around today with a preview build of the ODF Java API <a href=\"http:\/\/odftoolkit.org\/projects\/odfdom\/pages\/Home\">ODFDOM<\/a> 0.9.\u00a0\u00a0 One of the capabilities we&#8217;re adding is a simple text extraction API.<\/p>\n<p>The idea is to have a very simple API, a single function call in fact, that will allow you to extract the plain text from an ODF document.\u00a0 So strip all formatting, all layout and just return the text.\u00a0 At first you might think this is rather useless, but further reflection shows that it has myriad uses, including accessibility, search indexing, collaborative filtering, and text analytics in general.<\/p>\n<p>Extracting text from ODF is pretty simple.\u00a0 There are a handful of special cases to watch out for.\u00a0 One example is a single word that has mixed styles, e.g.: <strong>ODF<\/strong><em>DOM<\/em>.\u00a0 In ODF this looks like:<br \/>\n<code><br \/>\n&lt;text:span text:style-name=\"style1\"&gt;ODF&lt;\/text:span&gt;<br \/>\n&lt;text:span text:style-name=\"style2\"&gt;DOM&lt;\/text:span&gt;<\/code><\/p>\n<p>We want text extraction to come out as &#8220;ODFDOM&#8221; not &#8220;ODF DOM&#8221; with a space.<\/p>\n<p>On the other hand, there are other examples of adjacent elements, like with footnote citations, where we need to insert a space to prevent two adjacent strings from being conflated.<\/p>\n<p>Overall, the build I used looks pretty good, and works the same across text, spreadsheets and presentations.<\/p>\n<p>So I was looking this afternoon for something I could use to demo this new capability.\u00a0 I thought of using Jonathan Feinberg&#8217;s\u00a0 excellent <a href=\"http:\/\/www.wordle.net\/\">Wordle applet<\/a> (which I wrote about a <a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/2008\/06\/beautiful-word-clouds.html\">while back<\/a>).\u00a0 This applet creates a word cloud, based on word frequency of text you feed it.\u00a0 As a torture test I decided to feed it the text of\u00a0 ODF 1.2 Committee Draft 05, the version that is currently out for <a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/2010\/07\/odf12-public-review.html\">public review<\/a>.<\/p>\n<p>This is what I got for results.<\/p>\n<p>Part 1 is the annotations the schema for ODF.\u00a0 As expected, the key words are those referring to XML markup concepts like &#8220;attribute&#8221; and &#8220;element&#8221;:<\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/wp-content\/uploads\/2010\/07\/part1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" src=\"https:\/\/2d823b65bb.nxcli.io\/blog\/wp-content\/uploads\/2010\/07\/part1-300x259.png\" alt=\"\" width=\"300\" height=\"259\" \/><\/a><\/p>\n<p>Part 2: is OpenFormula, the spreadsheet formula express language.\u00a0 No XML in this part.\u00a0 In fact, this looks more like what I&#8217;d expect from an excerpt from a programming language specification, which pretty much what OpenFormula is.<\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/wp-content\/uploads\/2010\/07\/part2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" src=\"https:\/\/2d823b65bb.nxcli.io\/blog\/wp-content\/uploads\/2010\/07\/part2-300x132.png\" alt=\"\" width=\"300\" height=\"132\" \/><\/a><\/p>\n<p>And Part 3 is the packaging specification.<\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/wp-content\/uploads\/2010\/07\/part3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" src=\"https:\/\/2d823b65bb.nxcli.io\/blog\/wp-content\/uploads\/2010\/07\/part3-300x145.png\" alt=\"\" width=\"300\" height=\"145\" \/><\/a><\/p>\n<p>In the end text extraction is just the data preparation step.\u00a0 The real fun happens after,\u00a0 with the analysis and visualization techniques that can be applied to the text once extracted.<\/p>\n<p>If anyone is interested in trying out the text extraction module, please let me know.\u00a0\u00a0 We&#8217;re aiming for a release of ODF 0.9 toward the end of August, but I can probably get you a preview, if you are interested in testing.\u00a0\u00a0 And let me know if you have any brilliant ideas of what to do with the extracted text.\u00a0 I&#8217;m always looking for good demo material.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve been playing around today with a preview build of the ODF Java API ODFDOM 0.9.\u00a0\u00a0 One of the capabilities we&#8217;re adding is a simple text extraction API. The idea is to have a very simple API, a single function call in fact, that will allow you to extract the plain text from an ODF [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_genesis_hide_title":false,"_genesis_hide_breadcrumbs":false,"_genesis_hide_singular_image":false,"_genesis_hide_footer_widgets":false,"_genesis_custom_body_class":"","_genesis_custom_post_class":"","_genesis_layout":"","footnotes":""},"categories":[9],"tags":[],"class_list":{"0":"post-1124","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-odf","7":"entry"},"_links":{"self":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts\/1124","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/comments?post=1124"}],"version-history":[{"count":18,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts\/1124\/revisions"}],"predecessor-version":[{"id":1135,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts\/1124\/revisions\/1135"}],"wp:attachment":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/media?parent=1124"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/categories?post=1124"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/tags?post=1124"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}