Loading Terabytes of Data From Postgres Into BigQuery

To load data into BigQuery we’re going to use BigQuery CLI, which is a very versatile tool. You can install it using these instructions. As we’re on Linux, we’ll be using bash script in order to perform all the work. I assume BigQuery CLI is installed and authorized.

Let’s create bigquery-upload.sh and add the following function in order to upload a single day from a specific table:

#!/bin/bash
function upload_day {
table=$1
sel=$2
day=$3
next_day=$(date -d "$day+1 days" +%Y-%m-%d)
bq_suffix=$(date -d "$day" +%Y%m%d)
echo "Uploading $table: $day..."
psql -c "\\copy (select $sel from $table where created_at >= '$day' and created_at < '$next_day') TO '$table-$day.csv' WITH CSV HEADER"
gzip $table-$day.csv
bq load --allow_quoted_newlines --project_id --replace --source_format=CSV --autodetect --max_bad_records 100 .$table$bq_suffix $table-$day.csv.gz
rm $table-$day.csv.gz
};

This function has three arguments: table, columns for selection, and date to upload. As you can see, it uses the \copy operation to download a CSV from Postgres and then compresses it. The BigQuery docs say the loading of a compressed CSV is slower than uncompressed, but uploading uncompressed data almost always seems slower.

You can call this function by simply adding a line at the end of the script:

upload_day ‘transactions’ ‘*’ ‘2018-03-01’