I'm still re-reading the "Fourth Paradigm" book by Microsoft Research, and one section continues to intrigues me. There's a part where the book explains database design, and puts forth that the most important thing when you're designing large data sets is to find out the "Top Twenty Questions" the database has to answer. The quote is this:
"Most selections involving human choices follow a 'long tail,' or so-called 1/f distribution, it is clear that the relative information in the queries ranked by importance is logarithmic, so the gain realized by going from approximately 20 (24.5) to 100 (26.5) is quite modest."
I find this facinating - it just doesn't seem to make "common" sense. Surely you have to ask a lot more questions than that to "get" the shape of the data? I researched the mathematical concept he's describing (http://www.scholarpedia.org/article/1/f_noise), and I'll try some experiments here. I'll let you know what I uncover!
Here's the link for the book if you want to read it: