Stitch Fix ❤ UNIX

Simeon Willbanks
- Everywhere

I ❤ UNIX and using the command line; they help me solve problems at Stitch Fix. I’m not alone. Across the Data Science and Engineering teams, we’re constantly solving problems with UNIX and the command line.

Below, we’ve listed a few problems and their awesome command line solutions. If you know of a more efficient solution, please share in the comments. Otherwise, we’d love to hear how you or your team solve problems with UNIX and the command line.

Please enjoy!

Problems and Solutions

Files

Check Image Size

Dave Copeland (OS X / bash)

Problem

Our pickers in the warehouse were complaining that product images were too small (they need these images to help locate the right item for our customers). So, check all the images to see if they are the right size!

Solution

Created a script to run curl on the script’s argument to download the image, use ImageMagick to get its size and print that out in CSV. Piped my input CSV of images into xargs -n1 -P8 ./my_script.rb to basically run my script 8-way parallel to get the job done as fast as I could without setting my machine on fire.

Input
1,https://cdn.example.com/image_A.jpg
5,https://cdn.example.com/image_B.jpg
10,https://cdn.example.com/image_C.jpg
Command
cat my_input.csv | xargs -n1 -P8 ./check.rb > images_with_possible_issues.csv
  • xargs runs a command, feeding it STDIN as arguments to that command
    • echo "foo" | xargs ls is the same as "ls foo".
  • -n1 says to run the given command once for each line of input (normally xargs will run many lines at once, so if you had a file with 10 rows in it called "blah.csv" and do cat blah.csv | xargs curl, it would likely run curl once with all 10 rows of blah.csv given to curl, so -n1 runs it once per line of input).
  • -P8 says to parallelize it 8 ways.

So, I’ve got 8 instances of my script running at once. Obviously, there are diminishing returns on parallelism, but since curl‘ing images is mostly I/O bound this worked pretty well without compromising my machine.

Find Recently Changed Files

(OS X / bash)

Problem

Some file changed but I don’t know where. Sometimes this is “I don’t know where my web browser saved my file.” I want a list of the most recently changed file in a tree.

Solution

Find command as answered on Stack Overflow.

Input

cd to some directory

Command
# OS X find:
find . -type f -print0 | xargs -0 stat -f "%m %N" | sort -rn | head -1 | cut -f2- -d" "

# GNU find:
find . -type f -printf '%T@ %p\n' | sort -n | tail -1 | cut -f2- -d" "
  • The gnu find one does the heavy lifting inside the find command itself.
  • OS X find is a bit dumber so it just provides a list of files, gets modification times by calling stat on each one via xargs, then sort and clip.

Count Data Dump Number Of Lines

Deep Ganguli (OS X / bash)

Problem

A common question I have is: how many rows of data are in this file? The egregiously lazy method of obtaining an answer is to open a text editor, scroll to the bottom, and read off the last line number. This is inefficient.

Solution

Use wc.

Input

./foo.txt

i am
some lines
of data
four to be exact
Command
wc -l ./foo.txt
Output
4 ./foo.txt
  • The wc utility displays the number of lines, words, and bytes contained in each input file, or standard input.
  • The -l flag specifies that you want the number of (l)ines in the file!

Bulk Change Filenames

Greg Novak (OS X / bash)

Problem

I have a bunch of files and I want to change all of their names at once.

Solution

Use fnsed; a simple script which depends upon sed.

Input

Directory with files:

kitten-01.jpg
kitten-02.jpg
...
kitten-99.jpg
Command
fnsed s/kitten/stitchfix/ kitten*
Output

Now the directory contains:

stitchfix-01.jpg
stitchfix-02.jpg
...
stitchfix-99.jpg
Script
#!/bin/bash

if [ "$#" = "0" -o "$#" = "1" ]; then
    echo "Usage - fnsed <sed expression> <filename1> [filename2] ..."
    exit
fi

for oldfile in $* ; do
    # skip the first one b/c it's a sed expression
    if [ $oldfile != $1 ]; then
        newfile=`echo $oldfile | sed $1`
        if [ $oldfile != $newfile ]; then
	         mv $oldfile $newfile
        fi
    fi
done
  • fnsed is a shell script containing a simple loop over all the files

Quick File Copy To Remote Server

Greg Novak (OS X / bash)

Problem

I’m sick of typing:

scp -i ~/path/to/pem/file.pem some-file.txt ubuntu@ec2-11-22-33-44.amazon.com:/home/ubuntu/some/path

Or, worse, I want to scp something to a computer that I can’t reach directly (e.g. behind a firewall) so I have to do the copy in two steps. Yuck!

Solution

Put everything into the .ssh/config file:

Host shiny
  HostName ec2-11-22-33-44.amazon.com
  IdentityFile ~/path/to/pem/file.pem
  User ubuntu
Command

Now I can just type:

scp some-file shiny:path/to/dest

Yay!

Relative File Size Graph

Eli Bressert (OS X / zsh)

Problem

What are the relative sizes of files in a directory, in graphical form?

Solution

Use du and grep to get file sizes and generate a graph with spark.

Install

brew install spark

Command
du -k *.txt | grep -o '[0-9]*' | spark
Output
▂▁▁▂▂▂▂▃▅▄▄▃▄▃▃▃▃▄▄▅▅▄▅▅▅▄▅▄▄▆▄▆▅▄▅▅▅▅▅▅▅▅▅▆▅▅▆▆▅▅▄▄▅▅▅▅▆▅▆▅▆▅▅▅▅▆▅▆▆▅▆▆▆▅▅▅▅▆▅▆▆▅▆▆▆▅▅▇▅▇█▇▇

Directories

Directory File Summary

Greg Novak (OS X / bash)

Problem

Some directory contains a lot of files, and a lot of large files. For each directory, I want a summary of both the number of files and their sizes in human readable format (e.g. 37G instead of 37000000000).

Solution

find, du, sed, and wc commands in a bash loop.

Input

cd to some directory.

Command
for f in `find . -type d`;
  do bash -c "printf '%6s  %6s  %s' `du -s -h $f | sed s+./.*++g` `ls -l  $f | wc -l` $f";
  echo;
done
Output
  696K      13  ./sf/voodoo/voodoo/algorithm
  440K     111  ./sf/voodoo/voodoo/algorithm/config
  116K       4  ./sf/voodoo/voodoo/algorithm/features
   52K       4  ./sf/voodoo/voodoo/algorithm/predictors
  • find command gets list of all directories below the current one
  • The for loop loops over the directories
  • Inside the loop printf, sed, and wc massage output into the desired form

Sort Directories By Number Of Files

Greg Novak (OS X / bash)

Problem

Some directory contains a large number of files (they don’t take up a lot of disk space) and I want to find which one.

Solution

find command within a bash loop similar to the du command.

Input

cd to some directory

Command
find . -type d | while read -r dir;
  do printf "%d\t%s\n" `find "$dir" | wc -l` "$dir";
done | sort -n
Output
1	./IPython-notebook-extensions/.git/objects/info
1	./IPython-notebook-extensions/.git/refs/tags
...
1134	./sf/flinch
4273	./sf
5709	.
  • The first find gets all the directories below the current one
  • The while loop goes over each directory and finds all the files below it.
    • Not efficient, but I haven’t yet run into situations where it takes too long.
  • The final sort command puts output in a useful order.
    • The number of files is cumulative
    • In the above example, there are 4273 files in all directories below ./sf

Delete Directory With Large Number Of Files

Greg Novak (OS X / bash)

Problem

I have a directory containing a large number of files and I want to delete it, but rm * gives “Argument list too long” and refuses.

Solution

Use xargs command.

Input

cd to some directory

Command
# Replace ls with rm to delete
find . | xargs -n 100 ls
  • xargs will execute the given command on batches of 100 files at a time.
  • Note the replacement of the typical rm with ls

CSVs

Filter CSV File By Column Values

Simeon Willbanks (OS X / zsh)

Problem

We must filter a CSV file by specific column values.

Solution

awk the csv!

Input

./hours.csv

User ID,Hours Styling
1,15.90
2,17.43
3,15.01
4,18.20
5,15.55
6,16.33
Command
awk -F, '$2 ~ 15' ./hours.csv
Output
1,15.90
3,15.01
5,15.55
  • awk -F sets a “field separator”; for CSV files, this is a ‘,’
  • '$2 ~ 15' is the awk program
    • $2 is the second field which is “Hours Styling”
    • ~ is a regular expression operator, so we filter any lines with hours that match 15
Command
awk -F, '$2 == 15.01' ./hours.csv
Output
3,15.01
  • == is an equality operator, so we filter any lines with hours that match 15.01

CSV File Column Names And Indices

Nick Kridler (OS X, Linux, Unix / bash)

Problem

We don’t know a CSV file’s column names and indices. Once we know the CSV file column names and indices, we can easily extract data.

Solution

Use head to get the header and pipe it into awk.

Input

./file.csv

shirt,name,size,color,fit
1,blouse,L,Blue,Fitted
5,tank,M,Green,Loose
Command
head -n 1 ./file.csv | awk -F, '{for(i=1; i<=NF; i++) print i,$i}'
Output
1 shirt
2 name
3 size
4 color
5 fit
  • head -n 1 grabs the first line
  • awk -F sets a “field separator” and splits on commas
  • '{for(i=1; i<=NF; i++) print i,$i}' is the program which loops over the columns and prints the column index and name

Count The Occurrences Of A Value In A CSV File Column

Nick Kridler (OS X, Linux, Unix / bash)

Problem

We don’t know how many times a value appears in a CSV file column.

Solution

Now that we know how to find column indices, let’s count the occurrences of a value in a column using cut and uniq.

Input

./file.csv

shirt,name,size,color,fit
1,blouse,L,Blue,Fitted
5,tank,M,Green,Loose
7,blouse,S,White,Fitted
8,sweater,M,Brown,Loose
Command
tail -n +2  ./file.csv | cut -d, -f2 | sort | uniq -c
Output
2 blouse
1 sweater
1 tank
  • tail -n +2 gets all lines except the header
  • cut -d, -f2 grabs the 2nd column based on comma delimiters
  • sort sorts the column
  • uniq counts the distinct words

Systems

Display Resource Usage and Availability

Eli Bressert (OS X / zsh)

Problem

What is my system’s resource usage and availability?

Solution

Use htop; an interactive process viewer.

Install

brew install htop

Command
htop
Output
  1  [|||||||                                           10.5%]     Tasks: 187 total, 0 running
  2  [|                                                  0.6%]     Load average: 1.72 1.58 1.51
  3  [||||||||                                          13.2%]     Uptime: 2 days, 03:32:02
  4  [|                                                  0.7%]
  5  [||||                                               5.2%]
  6  [                                                   0.0%]
  7  [|||||                                              6.5%]
  8  [                                                   0.0%]
  Mem[|||||||||||||||||||||||||||||||||||        8369/16384MB]
  Swp[                                               0/1024MB]

  PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
75871 simeon    31   0 2407M  2144     0 C  0.0  0.0  0:00.00 htop

Easily Manage Python Virtual Environments

Jeff Magnusson (OS X / bash)

Problem

Managing multiple Python virtual environments is tedious.

Solution

Install pyenv to your ~/.bashrc.

Script

PYTHON_VIRTUALENV_BASEPATH="$HOME/python/virtualenv"

function pyenv {
    if [ -z "$1" ]; then
        echo `ls $PYTHON_VIRTUALENV_BASEPATH`
    elif [ $1 == '--create' ]; then
        pushd $PYTHON_VIRTUALENV_BASEPATH; virtualenv --no-site-packages $2; popd;
    else
        source $PYTHON_VIRTUALENV_BASEPATH/$1/bin/activate;
    fi
}

pyenv is a light wrapper around Python’s virtualenv command. Executed with no arguments (pyenv), it returns a list of currently installed virtual environments.

Executed with a single argument, it attempts to activate the virtualenv passed as the argument (pyenv my_virtual_env).

Executed with --create flag, it creates the virtual environment passed in the second argument (pyenv --create my_virtual_env).

Productivity

Alias Everything

Eric Gravert (OS X, Linux / bash)

Problem

I hate typing.

Solution

I alias EVERYTHING.

Command
alias be="bundle exec"
alias brake="be rake"
alias bspec="be rspec"
alias clock='date "+DATE: %Y-%m-%d%nTIME: %r"'
Problem

Typing an alias is still too many keystrokes.

Solution

You can create custom key bindings in bash which can be used to execute commands and custom scripts, saving valuable keystrokes.

Command
bind -x '"\C-t"':clock
Input
^t
Output
DATE: 2015-03-06
TIME: 01:08:53 PM

Tab Completion

Eric Gravert (OS X, Linux / bash)

Problem

I still hate typing.

Solution

You can set the CDPATH variable to add tab completion to the cd bash command. For example, if you have a directory which holds your projects, you can add that directory to the CDPATH variable to get tab completion of the project directories from anywhere in your file system.

It is important to remember to add . to the beginning of the directory list or you will lose tab completion in the current directory.

Command

Export CDPATH in your bash rc file (~/.bash_profile on mac).

For exmple:

export CDPATH=.:$HOME/workspace:$GOPATH/src/github.com:$GOPATH/src/code.google.com/p
  • After sourcing your ~/.bash_profile, you will now be able to type cd at any time, and tab completion will include the matching subdirectories of each directory specified in CDPATH.

Command Completion Notification

Greg Novak (OS X / bash)

Problem

I have a long-running command, and I want to be notified with a pop-up dialog on screen when it finishes.

Solution

Use terminal-notifier to send User Notifications on Mac OS X from the command-line.

Install
brew install terminal-notifier
Command

./ding

#!/bin/bash

terminal-notifier -message "Done"
long-running-command && ding
  • terminal-notifier pops up messages via the OS X Notification Center

Bemuse Coworkers

John McDonnell (Darwin Kernel Version 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 x86_64 / fish)

Problem

Colleague has left their laptop open and logged in, leaving their machine exposed to the whole world.

Solution

Cron is a common choice, but that’s a rookie move. It’s too obvious, and cron is so user friendly that it’s easy to find and fix.

Instead, use launchctl. They’ll never find it.

Command
cat <<END > $HOME/Library/LaunchAgents/com.system.critical.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.system.critical</string>

  <key>ProgramArguments</key>
  <array>
    <string>osascript -e "set Volume 10"</string>
    <string>say "$USERNAME loves ponies"</string>
    <string>curl -o /tmp/pony.jpg http://www.adweek.com/files/adfreak/images/2/shetland-ponies-cardigans-2.jpg; </string>
    <string>open /tmp/pony.jpg; </string>
  </array>

  <key>Nice</key>
  <integer>1</integer>

  <key>StartInterval</key>
  <integer>60</integer>

  <key>RunAtLoad</key>
  <true/>
</dict>
</plist>
END
;
launchctl load com.system.critical

This sets an hourly launchctl pony task which downloads and sets the user’s background to a picture of ponies, sets the volume to maximum, and says “$USERNAME loves ponies”.


UNIX WHOA

Thanks for reading! Don’t forget to share any improvements. Also, we’d love to hear more UNIX or command line tricks in the comments. We can all learn from each other. ❤ ❤ ❤

Tweet this post! Post on LinkedIn
Multithreaded

Come Work with Us!

We’re a diverse team dedicated to building great products, and we’d love your help. Do you want to build amazing products with amazing peers? Join us!