THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

John Paul Cook

NOSQL - Extracting keywords from PowerPoint using PowerShell

Yesterday I mentioned my desire to transform PowerPoint slides from just data to actual information. I've made good progress using PowerShell, but I need PowerShell help with a problem that I hope is of some general interest. Originally I considered using full-text search in SQL Server, but realized it wouldn't do what I wanted, thus the NOSQL approach. I need to extract the keywords from a PowerPoint presentation.

On the File menu in PowerPoint 2007/2010, the Save & Send has a Create Handouts option which allows you to create a Word document. What you get isn't searchable text, which means a different approach is needed. Creating an intermediate pdf file is an easy way to convert PowerPoint to text. Go to the File menu in PowerPoint and select Save As specifying PDF as the file type. Adobe Reader lets you open a pdf file and save the contents to a text file. In other words, in far less time and trouble than it took you to read this paragraph, I had my PowerPoint slides (raw data) converted into a simple text file.

I do not actually know PowerShell. If you are a developer, you of course realize that knowing how to use a tool is orthogonal to using it. I opened a PowerShell window and entered the following:

gc "antiinfectiveDrugsLecture.txt" |% {$_.split(" ")}

It parsed the text file into one word per line. It needs to be case-sensitive sorted, duplicates removed, and piped to a new text file. If you know how to do this, please post below. A nice to have feature would be a case-insensitive count of the number of times a word is duplicated. Or perhaps a double sort with frequency as the first sort key and the alphabetic sort key second. This would help in the quest for extracting information from data because it would provide a ranking of keywords. Keeping in mind that the PowerPoint slides are lecture notes. Knowing how many times the instructor used a keyword such as nephrotoxicity would indicate its relative importance to overwhelmed nursing students such as myself.

Published Friday, June 17, 2011 11:08 AM by John Paul Cook

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Ben Thul said:

Pipe your output to sort-object -CaseSensitive -Unique.

June 17, 2011 12:17 PM
 

Ben Thul said:

As for the counts, a simple hash will do.  Here's some sample code:

$a = @{}

$b = @(1, 1, 3, 4, 5, 3, 8, 7)

foreach ($i in $b) {

  $a[$i]++

}

$a.GetEnumerator() | sort -property Value -descending

June 17, 2011 12:23 PM
 

John Paul Cook said:

Thanks for the help. Here's my second iteration at working with PowerPoint files:

gc "antiinfectiveDrugsLecture.txt" |% {$_.split(" ")} | sort-object -CaseSensitive -Unique | out-File keywords.txt

June 17, 2011 1:03 PM
 

Eric Humphrey said:

John,

I've got a post on working with PowerPoint from PowerShell if you'd like to get the text straight out of PP without exporting.

http://www.erichumphrey.com/2011/02/powerpointshell/

June 17, 2011 3:39 PM

Leave a Comment

(required) 
(required) 
Submit

About John Paul Cook

John Paul Cook is both a Registered Nurse and a Microsoft SQL Server MVP experienced in Microsoft SQL Server and Oracle database application design, development, and implementation. He has spoken at many conferences including Microsoft TechEd and the SQL PASS Summit. He has worked in oil and gas, financial, manufacturing, and healthcare industries. Experienced in systems integration and workflow analysis, John is passionate about combining his IT experience with his nursing background to solve difficult problems in healthcare. He sees opportunities in using business intelligence and Big Data to satisfy healthcare meaningful use requirements and improve patient outcomes. John graduated from Vanderbilt University with a Master of Science in Nursing Informatics and is an active member of the Sigma Theta Tau nursing honor society. Contributing author to SQL Server MVP Deep Dives and SQL Server MVP Deep Dives Volume 2.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement