Source code: My Test-Drive github repo
Summary
- build a Postgres Docker image
- start a new container
- get data from College Scorecard web site
- perform an ETL to get a subset of the data
- install PostgresSql client on Windows
- load data to server
Build a custom Docker image based on Postgres 11
Dockerfile and create user/db script
FROM postgres:11.2
COPY ./create-bruno-db.sh /docker-entrypoint-initdb.d/create-bruno-db.sh
create-bruno-db.sh
- This file must be executable
If you edit it on Windows, make sure you convert Windows CR LF to Linux new line charater LF
#!/bin/sh set -e psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL CREATE USER bruno PASSWORD 'bruno'; CREATE DATABASE bruno; GRANT ALL PRIVILEGES ON DATABASE bruno TO bruno; EOSQL
Create a table called college
create schema yin;
create table yin.college(
id varchar(12) not null primary key,
name varchar(128) not null,
city varchar(36) not null,
state char(2) not null,
zip varchar(16) not null,
region int,
latitude numeric,
longitude numeric,
adm_rate numeric,
sat_avg numeric,
act_avg numeric,
earnings numeric,
cost numeric,
enrollments numeric
);
Get data from US Government College Scorecard
Data source
I will only use a subset of the “Most Recent Data”. This is a big csv file with many columns. I did some research using Python Jupyter Notebook, came up with very simple Python script to extract the columns I am interested for the lab.
import numpy as np
import pandas as pd
import csv
selected_cols = ['UNITID','INSTNM','CITY', 'STABBR','ZIP','REGION', 'LATITUDE', 'LONGITUDE',
'ADM_RATE', 'SAT_AVG', 'ACTCMMID', 'MD_EARN_WNE_P10', 'COSTT4_A','UGDS']
float_cols = ['LATITUDE', 'LONGITUDE','ADM_RATE', 'SAT_AVG', 'ACTCMMID', 'MD_EARN_WNE_P10', 'COSTT4_A','UGDS']
def export2csv(src_csv, out_csv):
df = pd.read_csv(src_csv, low_memory=False)
df2 = df[df.PREDDEG.isin(['2', '3'])]
df2[float_cols] = df2[float_cols].apply(pd.to_numeric, errors='coerce')
df2.to_csv(out_csv, columns=selected_cols, index=False)
if __name__ == '__main__':
import sys
export2csv(sys.argv[1], sys.argv[2])
Start a new container: this container will be the Postgres db server
docker run -e POSTGRES_PASSWORD=bruno \
-p 5432:5432 \
--name yin-postgres -d \
yin/postgres:11.2
Load data to Postgres server from developer Windows computer
Installation of PostgresSQL Windows client [and servver]
Simply download the right zip file that matches your OS architecture x86-64 or x86-32, and extract to a local directory.
Powershell sample script
# set up variables to run psql
$env:PGHOST = '192.168.0.48'
$env:PGDATABASE = 'bruno'
$env:PGUSER = 'bruno'
$env:PGPASSWORD = 'bruno'
# On Windows, you need to download Postgres for Windows, server and client can be found in a zip file
# https://www.enterprisedb.com/download-postgresql-binaries
$env:Path = "D:\tools\pgsql\bin;${env:Path}"
# On Linux: psql can be installed via package management. No manual changes needed to the PATH environment variable
# $env:PATH
# USE COPY TO LOAD CSV
cat .\bruno.csv | psql -c 'COPY yin.college FROM STDIN CSV HEADER'
# to test, run psql and do some queries on yin.college