ASA Sections on:

Statistical Computing
Statistical Graphics

Data expo ‘09

Using a database

For moderate initial investment in time, and a large investment in space (>30 gigabytes), you can considerably speed up access to the data by loading it into a database. This page shows you how to do so for sqlite, an open-source sql database.

Create database

Creating a new database couldn't be simpler: just run the following on the command line to create a new database in the current directory.

sqlite3 ontime.sqlite3

Create table and import data

Next create a table with fields that match the csv files:

create table ontime (
  Year int,
  Month int,
  DayofMonth int,
  DayOfWeek int,
  DepTime  int,
  CRSDepTime int,
  ArrTime int,
  CRSArrTime int,
  UniqueCarrier varchar(5),
  FlightNum int,
  TailNum varchar(8),
  ActualElapsedTime int,
  CRSElapsedTime int,
  AirTime int,
  ArrDelay int,
  DepDelay int,
  Origin varchar(3),
  Dest varchar(3),
  Distance int,
  TaxiIn int,
  TaxiOut int,
  Cancelled int,
  CancellationCode varchar(1),
  Diverted varchar(1),
  CarrierDelay int,
  WeatherDelay int,
  NASDelay int,
  SecurityDelay int,
  LateAircraftDelay int

Then load the data with the .import directive:

.separator ,
.import 2008.csv ontime
.import 2007.csv ontime
.import 2006.csv ontime

Unfortunately this also imports the column headers, so remove them with this sql:

delete from ontime where typeof(year) == "text";

Adding indices

To speed up access to the data, you'll also want to add indices. The code below adds a few that I've found useful, but you'll want to think about your needs and add your own. (It's most efficient to create the indices after all the data has been loaded, so make sure you've done that first.)

create index year on ontime(year);
create index date on ontime(year, month, dayofmonth);
create index origin on ontime(origin);
create index dest on ontime(dest);

Sqlite and R

If you use R, be sure to check out RSQlite. This lets you easily get the data out of sqlite into R data frames:


ontime <- dbConnect("SQLite", dbname = "ontime.sqlite3")
from_db <- function(sql) {
  dbGetQuery(ontime, sql)

from_db("select count(*), tailnum from ontime group by tailnum")

tails <- from_db("select distinct tailnum from ontime")
© 2017. Email webmaster