Jamie Thomson

This is the blog of Jamie Thomson, a data mangler in London working for Dunnhumby

Sorting then formatting du output on Linux

This is my first blog post in eons and, don’t laugh, its about using Bash/Linux (nothing to do with the recent announcement of  Bash on Windows, I’ve been using Bash on and off for about a year now). As I’m now a hadoop monkey the linux command-line is where I spend a lot of my time and today I discovered awk for the first time and the cool stuff one can do with it.

My challenge was to get the size of a bunch of HDFS folders within a given folder, sort the results, then format the output to be human-readable (i.e. use K, M, G, T depending on whether the size should be measured in KB, MB etc…). Hadoop has a command to get the size of a bunch of folders and format the numbers to be human readable

hadoop fs –du –h /path/to/folder

but once the output is formatted it can’t be sorted so that was no good. That’s when I discovered what awk can do for you. Rather than try and explain it I’ll just put this here:

$ hdfs dfs -du -s /foo/bar/*tobedeleted | sort -r -k 1 -g | awk '{ suffix="KMGT"; for(i=0; $1>1024 && i < length(suffix); i++) $1/=1024; print int($1) substr(suffix, i, 1), $3; }'
28T /foo/bar/card_dim_h_tobedeleted
20T /foo/bar/transaction_item_fct_tobedeleted
2T /foo/bar/card_dim_h_new_tobedeleted
2T /foo/bar/hshd_loyalty_seg_tobedeleted
1T /foo/bar/prod_dim_h_tobedeleted
607G /foo/bar/promo_item_fct_tobedeleted
456G /foo/bar/card_dim_c_tobedeleted
340G /foo/bar/ch_contact_offer_alc_fct_tobedeleted
203G /foo/bar/prod_dim_h_new_tobedeleted
184G /foo/bar/card_dim_h_test_tobedeleted
166G /foo/bar/offer_dim_h_tobedeleted
115G /foo/bar/promo_dim_h_tobedeleted
87G /foo/bar/offer_tier_dtl_h_tobedeleted
84G /foo/bar/ch_contact_offer_dlv_fct_tobedeleted
50G /foo/bar/ch_contact_event_dlv_fct_tobedeleted

All sorted in descending order and nicely formatted to be human-readable. Cool stuff. I’m mainly putting this here so I can find it later when I need it but thought it might be interesting for others also.


P.S. Yes, I deleted about 55TB of data today Smile

Published Tuesday, April 19, 2016 11:38 PM by jamiet
Filed under: , ,

awkmonger said:

Nice to see youngsters getting to grips with old technology ;)

May 3, 2016 9:09 AM

jamiet said:

First time for everything. And believe me, I can only dream of being a youngster :)

May 3, 2016 9:19 AM

am said:

Thanks for the post! Was helpful for my situation. I've also found the following helpful too:

du -sh ./* | sort -rh

not sure if it will with your environment though.

June 28, 2016 9:37 PM

