Monday, July 28, 2014

Creating users in a deployment script

The simple and lazy thing to do when creating a user in a deployment script is to throw a plain text password in the script. Avoid this temptation.

Here's a better way, which generates a random password and stores it in /root/ of the provisioned machine in case you need it. The major problem this avoids is that this software can now be safely made public or stored on github without concern to exposing credentials.

# create user
sudo apt-get -y install makepasswd
PASSWORD=`cat /dev/urandom | head -n 1 | base64 | fold -w 10 | head -n 1`
echo $PASSWORD | sudo tee /root/tangelo_password.txt
passhash=$(sudo makepasswd --clearfrom=/root/tangelo_password.txt --crypt-md5 |awk '{print $2}')
sudo useradd theusername -m -p $passhash
sudo usermod -s /bin/bash theusername
# end create user

Saturday, July 19, 2014

Floating shelves

This weekend's project was floating shelves. My wife found inspiration for these on pinterest. I made two sets of three, one set for the living room and one set for our bedroom. Here are pictures of the end results.




The construction was fairly straightforward table saw work. The edges were cut from a normal pine 2x4 you can get at the big box stores for 2$. I used two of them for this project. The top and bottom was made from 1/4 inch sanded plywood. I used a 4x4 ft section of plywood for all of these shelves and had some left over. The shelves vary in length but are all 6 inches deep from wall to edge.

I cut the 2x4s roughly to length on the table saw, then used my joiner on two sides of each piece to make them straight, then ripped it on the table saw to size using the joined edges as reference. Here's a video from somebody else (The Wood Whisperer) on how and why to do that.

Before cutting the corners to fit together, I cut the rabbets for the plywood with two cuts on the table saw. The rabbets are the grooves that allow the plywood to sit flush with the top of the edge board.

I then mitered the corners to fit together at 45 degrees, cut the plywood to fit for each shelf, and then glued them together. The plywood was glued everywhere it touches the edges. The miter corners are also glued together. Instead of clamps I used a pneumatic brad nailer and 5/8 inch 18 gauge brads on the top and bottom. I did not shoot any nails into the corners because I wanted to do the final bevel and routing after assembly.

After the glue set (an hour or so) I cut the edge bevel and ran the router along the bottom edge. The brad holes and any other gaps were filled in with wood putty and then sanded. If I was going to stain these instead of painting them, I probably would have used clamps instead of brad nails to avoid the holes and putty.  Here are pictures of how the plywood sits in the rabbets.





I used a micrometer to measure for brackets to hold the shelves to the wall. The brackets are just a chunk of wood that fits exactly inside the shelves top to bottom. I cut them to have some slack left to right to make assembly easier. When the shelves are installed these are completely hidden. Here is a picture of a bracket fitting inside of the shelf.


I took the brackets, a pencil, and a stud finder into the house where the shelves would hang. I marked the stud locations on the brackets then pre-drilled and counter-sunk some holes in them. Since the brackets are hidden I was free to use any size screw or bolt to secure them to the wall. Three inch deck screws into the wall studs worked fine for me. Drywall expansion fasteners would work great too.




When I screwed the brackets to the wall, I made sure they were level.



Then it was just a matter of sliding the painted shelf onto the bracket, pre-drilling a hole, and sinking a screw to hold them on.






With the brackets solidly mounted to the wall, I think these shelves will break apart before they ever droop or fall down. I hate droopy floating shelves and like this system more than the floating shelves you can buy. If I really want it to look clean, I might paint the screws, but for now they'll be hidden behind picture frames. The overall cost of construction was less than 5$ per shelf and all six only took three or so hours total construction time.




Monday, July 7, 2014

curation is not preparation

I was on an email thread where somebody mentioned spending 80% of their time on data curation. I wrote this in response.

I think there's a semantic error we're drifting into. I think the original thread below mixes up the activity of curation and data preparation. Data prep does take a lot of time, but curation is a different thing. I'll try to describe what I'm thinking here. Curation is more of an act of protecting integrity for later use, analysis, or reproducibility. Like a museum curates. The tools for data preparation are awesome and very useful, but move beyond what is just a curation activity which I'd describe as something more fundamental.


data curation != data preparation

Provenance is always paramount when making decisions off of data. That's why trustworthy persistent storage of the raw data is most important. Any transform should not only be documented but should be reproducible.  Canonical elements of a curation system would be the raw data storage (original formats and as attained), transform code, and documentation.

Enabling discovery (locating, standardized access protocols, etc...) starts getting into something beyond curation. The connotation of curation implies preserving the scientific integrity of the data. Like dinosaur bones or artifacts being curated in a museum. Some of them are on display (available via "standardized protocols" on display), but the rest are tucked away safely in a manner that doesn't taint later analysis of them. More often than not, the bones on display are actually semi-faithful reproductions of the artifacts rather than the original artifact. Same thing with data. The graph visualization (or whatever) of the data might not be technically still the same data (different format, projection, normalized, transforms, indexes, geocoding, etc...) but it's a faithful reproduction of it that we put in the display case to educate others about the data. Like a fiberglass T-rex skull tells us a lot about the real thing, it's not meant for nuanced scientific analysis. All transforms of data, especially big data, contain an element of risk and loss of fidelity which would taint later analysis. We're all so bad at transforms that we avoid using them in cases where a life is at risk (court proceedings or military intelligence analysis) processes require citation of raw data sets. A geocoding API rarely assigns location with 100% accuracy (it's usually an exception when they do), sometimes when we do things like normalize phone numbers there's an edge case that the regular expressions don't account for, things can go wrong in an unlimited number of ways (and have... I've caught citogenesis happening within military intelligence analysis several times). The only way to spot these problems later and preserve integrity of the data is to store it in it's most raw form. If we wish to provide access for others to a projection they want to build off, the best way to do it would be to share the raw, the transform code, and a document showing the steps to get to the projection. In the email below this behavior of later analysts/scientists is noted (with disdain?). It shouldn't take long to look at previous transforms and reproduce the results, if it does then those transforms weren't that reliable anyway. If those receiving the data just want to look at a plastic dinosaur skull to get an idea of it's size and gape in wonder, then sharing data projections (raw data that has undergone a transform) is fine.

When providing curated data for research or analysis, I even make it a point to keep a "control copy" of the data in an inaccessible location. That way if there is a significant finding there is a reliable way to determine that it's not because the data became tainted during an inadvertent write operation on it.

On the other end of the spectrum (which I see all the time) is the unmarked hard drive with random duplicative junk dumped on it as an afterthought to whatever project the data originated from. Although "no" is never an answer when handed something like this and certainly useful analysis can be achieved, this is below the minimum. I imagine it's like being handed evidence for a murder case in a plastic shopping bag with the previous investigator's half-eaten burrito smeared all over it.  You can sometimes make the case for a decision with it, but it's not easy and it's a dirty job trying to make sense of the mess. This is probably more the norm when it comes to "sharing" curated data in government and industry. It's ugly.

The minimum set of things needed for useful data curation:
1. raw data store (just the facts, stored in a high integrity manner)
2. revision control for transform code (transform code, applications, etc...)
3. documentation (how to use transforms, why, provenance information)

Everything beyond this could certainly be useful (enabling easier transforms, discovery APIs, web services, file transfer protocol interfaces), but is beyond the minimum for curation. Without these three basic things, useful curation breaks.