Friday, April 25, 2014

Counting unique file extensions recursively

Here's a quick example of the Linux command line tools working together to do something more complex.

Question:
I wonder what types of files are in this directory structure and and how many of each type there are?

Answer:
A simple way to approach that would be to count the unique file extensions.

Let's start with getting a list of files.
eric@eric-Precision-M6500:~/workspace/temp$ find . -type f  
./page/content/ji.png
./page/content/index.html
./page/content/conf.png
./page/deploy
...


Ok... that worked. We can count them.
eric@eric-Precision-M6500:~/workspace/temp$ find . -type f | wc -l
67


But we're interested in just the file extensions, so let's get those. This regex matches a period followed by any number of non-period non-slash characters until the end of line and prints whatever is in the parentheses.
eric@eric-Precision-M6500:~/workspace/temp$ find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/'
png
html
png
sh
...


We can do the same thing with sed (thanks Chris!).
eric@glamdring:~/Pictures$ find . -type f | sed -e 's/^.*\.//'
jpg
jpg
jpg
eric@glamdring:~/Pictures$


We want to count the unique extensions, but the uniq command only works when unique objects are next to each other in the input stream so we have to sort the extensions first.
eric@eric-Precision-M6500:~/workspace/temp$ find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort | uniq -c 
      1 html
      1 md
      2 png
      9 sample
      1 sh
      2 txt



Let's put some commas in the output so we can import into calc as csv and make a pie chart.
eric@eric-Precision-M6500:~/workspace/temp$ find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort | uniq -c | awk '{print $2 "," $1}'
html,1
md,1
png,2
sample,9
sh,1
txt,2


That looks good. We'll redirect the output to a file and count unique extensions in a more interesting directory, all of the DARPA XDATA open source code on this computer.
eric@eric-Precision-M6500:~/workspace/xdata/code$ find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort | uniq -c | awk '{print $2 "," $1}' > extensions.csv





Extensions are a naive way to determine file type. The 'file' command is a little smarter, it combines several tests (filesystem, magic, and language) to determine file type. Here's an example of how to use it for our experiment.

eric@glamdring:~/workspace/randomproject$ find . -type f | awk 'system("file " $1)' | sed -e 's/^.*\://' | sort | uniq -c
    282  ASCII text
     18  ASCII text, with very long lines
      6  Bourne-Again shell script, ASCII text executable
     14  data
      2  empty
     45  GIF image data, version 89a, 16 x 10
      1  Git index, version 2, 606 entries
      1  Git index, version 2, 8 entries
      1  HTML document, ASCII text
     15  HTML document, Non-ISO extended-ASCII text
...


Here are some iterative improvements.

An awk-free version using the -exec option of find. 
find . -type f -exec file {} \; | sed 's/.*: //' | sort | uniq -c

A sed-free version of the awk-free vesion using the non-verbose output from file:
find . -type f -exec file -b {} \; | sort | uniq -c

If you have an idea for how to make this shorter or more effective, send me a note on social media and I'll include it.










Vagrant boilerplate

Not so long ago integrating software meant doing configuration management at the virtual machine level as integration took place. Taking snapshots and reverting if there was trouble. What you're left with a ~4Gb clunky file that you have to push around and convert to your eventual deployment environment. This is inefficient and risky. There's all the cruft that tags along as the builds take place. There's the complexities of image conversion.  Some painful workarounds sometimes need to take place to move it between environments (VmWare, deleting network interfaces). Hostname problems (Oracle, thanks for that). Security implementation guidelines also change at a moments notice or in different environments. And what if you want to deploy it to bare metal after all?
Thankfully, some better tools are available. If you haven't figured out how to use Vagrant yet, go learn it right now. If you know what a vm is and can use the linux command line, it will probably take you all of 15 minutes to master it.

Here's the website: http://www.vagrantup.com/

Basically vagrant has one configuration file where you specify a baseline OS image (a .box file), some other parameters about how you want it to behave. I say basically because this file does kick off and point to other config files that deploy your software. To kick all of this off you type 'vagrant up'.

Log in to the resultant system ('vagrant ssh') then test, develop, capture the changes you care about in the deployment files. When you mess it up, instead of reverting, wipe the whole thing out of existence with another command: 'vagrant destroy'. Lather, rinse, repeat. Within a short amount of time, what you're left with are some finely tuned deployment scripts that can take any baseline OS image and get it to where you want with one command. Vagrant allows you to spin up multiple vms at the same time so you can emulate and test more complex environments in this manner.


Although vagrant supports chef and puppet, my preference has been to use bash scripts to deploy software. Bash scripting is accessible to most people. When collaborating with groups of people from different organizations, it serves as common language. Recently, I've taken my boilerplate vagrant configurations and put them online.

It's best to separate parts of the deployment process. Don't write the commands that secure the system in the same file as the commands that deploy software components or data. Abstract it all. Then, when your deployment environment changes you only have to modify or switch out that one file. You can capture the security requirements for Site A and keep them separate from Site B. Want to deploy to a Site C? Build it out and you're only one command away from testing if everything works. If a security auditor asks how you configured your system, then send them the deployment file. If they have a critique (and they're good) they can make the recommended changes and send them back to you where you can just test them by running 'vagrant up' again.

My boilerplate includes simple snippets for how to create users, push around files, append to files, wget things to where  you want them, and other useful things that people forget when doing things from scratch (like writing the .gitignore). It should be easy to go through and modify it to do what you want it to do.

Here's a link:
https://github.com/darpa-xdata/vagrant-vm-boilerplate

It has Ubuntu as the base OS, here's a list of a bunch of other base OS's you can use without having to roll your own. Just modify the vagrant file.
http://www.vagrantbox.es/



Linux system administration

It's done just like this.


Monday, April 21, 2014

Weed superior firepower

The weather is warm and my in-laws are in town and spending time with the kids. This gave me a little time to take care of chores like de-weeding my front yard. The mulched area around the shrubs in front of my house has been invaded by weeds this spring.

Two minutes into picking them out with a small shovel on my hands and knees I got up and walked into the garage to design and build a better tool. ...it's funny that this is exactly how software engineering and data-research works. If you have a boring or repetitive task you should find or design and build a tool to make it easier. "Necessity" isn't the real mother of invention; it's laziness. Or, more aptly described, it's a natural outgrowth of intellectual curiosity. But I digress. We're talking about the scrap metal I melted together in my garage so I could dig in my yard, not metapsychology.

Here are the parts I used, and the end result. On the left is a 1.5" steel square tube, some 3/16" angle iron, and a short piece of reinforcing bar (rebar). The handle was cut from a piece of pine 2x4. The unstained handle was a quick prototype. I did a test-fit and then finished the welds while the head was on this handle (hence the burn marks and destruction on it). It caught fire several times while I was welding and grinding. I shouldn't have enjoyed that as much as I did. Every time I see something burning I think about this amazing Feynman description of fire. "The light and the heat coming out, that's the light and heat of the sun that went in!"


This square handle was easy to make and fit together. Everything was cut just using the table-saw fence and finishing it off with a chisel to get to the areas that the circular blade couldn't reach. Slice it down the center and slam home a wedge and this ended up being a solid way to put everything together. I think it's way better more classy than using bolts to attach the handle. Retro. 50,000 years retro.

I wanted something small that I could stomp on with all of my body weight to sever or dig out stubborn roots if I had to. Here's a short video of how it works. I filmed this with my phone in one hand and the tool in the other; with the benefit of both hands the tool is really efficient.



Here's me lining up on a weed:
After a successful extraction, roots and all:

The welding on the tool is a little excessive. Part of the reason for undertaking this project was to teach my father-in-law how to weld, so we kept taking passes on the metal. It's interesting to see how others interpret verbal instruction or how they pick up new tools and methods. His initial pass he waved it back and forth like a paint spray gun; I didn't expect him to do that. On the second try, I made sure to explain that you have to hold in the same spot until the metal starts to melt and then slowly lay a bead of weld. I didn't bother grinding things down very much afterward, just spray painted it and hung it on the handle. That's the fun part of building garden tools, they don't have to be pretty.

A note about the steel. This is mild steel, and that's ok. I had some high carbon steel that would have been harder, but I'd prefer a digging tool to turn an edge rather than chip. I know this tool will be slammed onto rocks and concrete, but any damage done can easily be pounded (preferred method for those of us that own good anvils and hammers) or filed out (last resort) before resharpening with a file.

This build was inspired by Wranglerstar. If you like this kind of stuff check out his videos.