Yesterday I mentioned my desire to transform PowerPoint slides from just data to actual information. I've made good progress using PowerShell, but I need PowerShell help with a problem that I hope is of some general interest. Originally I considered using full-text search in SQL Server, but realized it wouldn't do what I wanted, thus the NOSQL approach. I need to extract the keywords from a PowerPoint presentation.
On the File menu in PowerPoint 2007/2010, the Save & Send has a Create Handouts option which allows you to create a Word document. What you get isn't searchable text, which means a different approach is needed. Creating an intermediate pdf file is an easy way to convert PowerPoint to text. Go to the File menu in PowerPoint and select Save As specifying PDF as the file type. Adobe Reader lets you open a pdf file and save the contents to a text file. In other words, in far less time and trouble than it took you to read this paragraph, I had my PowerPoint slides (raw data) converted into a simple text file.
I do not actually know PowerShell. If you are a developer, you of course realize that knowing how to use a tool is orthogonal to using it. I opened a PowerShell window and entered the following:
gc "antiinfectiveDrugsLecture.txt" |% {$_.split(" ")}
It parsed the text file into one word per line. It needs to be case-sensitive sorted, duplicates removed, and piped to a new text file. If you know how to do this, please post below. A nice to have feature would be a case-insensitive count of the number of times a word is duplicated. Or perhaps a double sort with frequency as the first sort key and the alphabetic sort key second. This would help in the quest for extracting information from data because it would provide a ranking of keywords. Keeping in mind that the PowerPoint slides are lecture notes. Knowing how many times the instructor used a keyword such as nephrotoxicity would indicate its relative importance to overwhelmed nursing students such as myself.