ASA Sections on:

Statistical Computing
Statistical Graphics

Data expo ‘09

Command line tools

Dealing with this amount of data is definitely a challenge and we hope that this data expo will inspire you to learn more about dealing with large volumes of data. To make sure you don't get overwhelmed, this page describes some simple command line tools to sort, filter and tabulate.

All of these tools are available on a default install of linux or mac os x. If you want to use them on windows, you will need to install cygwin or similar.

Sort: sort

Sort by the 10th column (flightnum):

sort -t, -k 10,10 2008.csv

Filter rows: awk

Useful reference

Remove header rows:

awk -F, '$NR != 1' 2008.csv

Show flights from Des Moines to Chicago O'hare

awk -F, '$17 == "DSM" && $18 == "ORD"' 2008.csv

Filter columns: cut

Select only columns 9 (carrier) and 10 (flight num):

cut -f9,10 -d, 2008.csv

Putting it all together

Count the number of flights for each flight number and save to 2008-flights.csv:

cut -f9,10 -d, 2008.csv | sort | uniq -c > 2008-flights.csv
© 2017. Email webmaster