Friday, April 25, 2014

Counting unique file extensions recursively

Here's a quick example of the Linux command line tools working together to do something more complex.

Question:
I wonder what types of files are in this directory structure and and how many of each type there are?

Answer:
A simple way to approach that would be to count the unique file extensions.

Let's start with getting a list of files.
eric@eric-Precision-M6500:~/workspace/temp$ find . -type f  
./page/content/ji.png
./page/content/index.html
./page/content/conf.png
./page/deploy
...


Ok... that worked. We can count them.
eric@eric-Precision-M6500:~/workspace/temp$ find . -type f | wc -l
67


But we're interested in just the file extensions, so let's get those. This regex matches a period followed by any number of non-period non-slash characters until the end of line and prints whatever is in the parentheses.
eric@eric-Precision-M6500:~/workspace/temp$ find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/'
png
html
png
sh
...


We can do the same thing with sed (thanks Chris!).
eric@glamdring:~/Pictures$ find . -type f | sed -e 's/^.*\.//'
jpg
jpg
jpg
eric@glamdring:~/Pictures$


We want to count the unique extensions, but the uniq command only works when unique objects are next to each other in the input stream so we have to sort the extensions first.
eric@eric-Precision-M6500:~/workspace/temp$ find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort | uniq -c 
      1 html
      1 md
      2 png
      9 sample
      1 sh
      2 txt



Let's put some commas in the output so we can import into calc as csv and make a pie chart.
eric@eric-Precision-M6500:~/workspace/temp$ find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort | uniq -c | awk '{print $2 "," $1}'
html,1
md,1
png,2
sample,9
sh,1
txt,2


That looks good. We'll redirect the output to a file and count unique extensions in a more interesting directory, all of the DARPA XDATA open source code on this computer.
eric@eric-Precision-M6500:~/workspace/xdata/code$ find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort | uniq -c | awk '{print $2 "," $1}' > extensions.csv





Extensions are a naive way to determine file type. The 'file' command is a little smarter, it combines several tests (filesystem, magic, and language) to determine file type. Here's an example of how to use it for our experiment.

eric@glamdring:~/workspace/randomproject$ find . -type f | awk 'system("file " $1)' | sed -e 's/^.*\://' | sort | uniq -c
    282  ASCII text
     18  ASCII text, with very long lines
      6  Bourne-Again shell script, ASCII text executable
     14  data
      2  empty
     45  GIF image data, version 89a, 16 x 10
      1  Git index, version 2, 606 entries
      1  Git index, version 2, 8 entries
      1  HTML document, ASCII text
     15  HTML document, Non-ISO extended-ASCII text
...


Here are some iterative improvements.

An awk-free version using the -exec option of find. 
find . -type f -exec file {} \; | sed 's/.*: //' | sort | uniq -c

A sed-free version of the awk-free vesion using the non-verbose output from file:
find . -type f -exec file -b {} \; | sort | uniq -c

If you have an idea for how to make this shorter or more effective, send me a note on social media and I'll include it.