Source code: My Test-Drive github repo

Summary

  • build a Postgres Docker image
  • start a new container
  • get data from College Scorecard web site
  • perform an ETL to get a subset of the data
  • install PostgresSql client on Windows
  • load data to server

Build a custom Docker image based on Postgres 11

Dockerfile and create user/db script

FROM postgres:11.2
COPY ./create-bruno-db.sh /docker-entrypoint-initdb.d/create-bruno-db.sh

create-bruno-db.sh

  • This file must be executable
  • If you edit it on Windows, make sure you convert Windows CR LF to Linux new line charater LF

    #!/bin/sh
    set -e
    
    psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
    CREATE USER bruno PASSWORD 'bruno';
    CREATE DATABASE bruno;
    GRANT ALL PRIVILEGES ON DATABASE bruno TO bruno;
    EOSQL

Create a table called college

create schema yin;

create table yin.college(
    id varchar(12) not null primary key,
    name varchar(128) not null,
    city varchar(36) not null,
    state char(2) not null,
    zip varchar(16) not null,
    region int,
    latitude numeric,
    longitude numeric,
    adm_rate numeric,
    sat_avg numeric,
    act_avg numeric,
    earnings numeric,
    cost numeric,
    enrollments numeric
);

Get data from US Government College Scorecard

Data source

I will only use a subset of the “Most Recent Data”. This is a big csv file with many columns. I did some research using Python Jupyter Notebook, came up with very simple Python script to extract the columns I am interested for the lab.

import numpy as np
import pandas as pd
import csv

selected_cols = ['UNITID','INSTNM','CITY', 'STABBR','ZIP','REGION', 'LATITUDE', 'LONGITUDE',
    'ADM_RATE', 'SAT_AVG', 'ACTCMMID', 'MD_EARN_WNE_P10', 'COSTT4_A','UGDS']
float_cols = ['LATITUDE', 'LONGITUDE','ADM_RATE', 'SAT_AVG', 'ACTCMMID', 'MD_EARN_WNE_P10', 'COSTT4_A','UGDS']

def export2csv(src_csv, out_csv):
    df = pd.read_csv(src_csv, low_memory=False)
    df2 = df[df.PREDDEG.isin(['2', '3'])]
    df2[float_cols] = df2[float_cols].apply(pd.to_numeric, errors='coerce')
    df2.to_csv(out_csv, columns=selected_cols, index=False)

if __name__ == '__main__':
    import sys
    export2csv(sys.argv[1], sys.argv[2])

Start a new container: this container will be the Postgres db server

docker run -e POSTGRES_PASSWORD=bruno \
-p 5432:5432 \
--name yin-postgres -d \
yin/postgres:11.2

Load data to Postgres server from developer Windows computer

Installation of PostgresSQL Windows client [and servver]

Simply download the right zip file that matches your OS architecture x86-64 or x86-32, and extract to a local directory.

Postgres Windows binary

Powershell sample script

# set up variables to run psql
$env:PGHOST = '192.168.0.48'
$env:PGDATABASE = 'bruno'
$env:PGUSER = 'bruno'
$env:PGPASSWORD = 'bruno'

# On Windows, you need to download Postgres for Windows, server and client can be found in a zip file
# https://www.enterprisedb.com/download-postgresql-binaries
$env:Path = "D:\tools\pgsql\bin;${env:Path}"

# On Linux: psql can be installed via package management. No manual changes needed to the PATH environment variable
# $env:PATH 

# USE COPY TO LOAD CSV
cat .\bruno.csv | psql -c 'COPY yin.college FROM STDIN CSV HEADER'

# to test, run psql and do some queries on yin.college