Ultimate PostgreSQL Slug Function

What is a slug?!

If you’re reading this, you probably know, but a slug is a URL/SEO friendly representation of a string. If for example you wrote an article:

The World’s “Best” Cafés!

You’d want a slug function to create a representation that looks a bit like:

the-worlds-best-cafes

Granted browsers can probably handle the accented “e” but it’s good to be over conservative for backwards compatibility.

PostgreSQL slug function

The most comprehensive slug function I’ve seen for Postgres (and in general) is by ianks on Github. Slug functions are one of those things that never seems to have a definitive “best and final” version, but this has been very reliable for me.

Quick and Dirty Address Matching with LibPostal

  • While it is not a 100% solution, using normalized addresses and full text search provides a relatively fast (less than 100ms) matching approach for loose address matching.

libpostal only has two operations, “address normalization” and “address parsing”, that are exposed by pgsql-postal with the postal_normalize() and postal_parse() functions.

Normalization takes an address string and converts it to all the standard forms that “make sense”. For example:

SELECT unnest(postal_normalize('390 Greenwich St., New york, ny, 10013'));
390 greenwich saint new york ny 10013
390 greenwich saint new york new york 10013
390 greenwich street new york ny 10013
390 greenwich street new york new york 10013

Parsing takes apart a string into address components and returns a JSONB of those components:

SELECT jsonb_pretty(postal_parse('390 Greenwich St., New york, ny, 10013'));
{                           
"city": "new york",     
"road": "greenwich st.",
"state": "ny",          
"postcode": "10013",    
"house_number": "390"   
}

We can use normalization to create a table of text searchable address strings, and then use full text search to efficiently search that table for potential matches for new addresses.

International Normalization

As we saw above, normalization takes raw address strings and turns them into “possible standard forms”, which are suitable for searching against. They aren’t necessarily the best forms, more regionally-aware parsing can do a better job of standard North American parsing and formatting, but where libpostal shines is being a ready-to-run fully international solution that doesn’t even need to be told what language it is working on.

For example, this address in Berlin:

SELECT unnest(postal_normalize('Potsdamer Straße 3, 10785 Berlin, Germany'));
potsdamer strasse 3 10785 berlin germany

Loading Terabytes of Data From Postgres Into BigQuery

To load data into BigQuery we’re going to use BigQuery CLI, which is a very versatile tool. You can install it using these instructions. As we’re on Linux, we’ll be using bash script in order to perform all the work. I assume BigQuery CLI is installed and authorized.

Let’s create bigquery-upload.sh and add the following function in order to upload a single day from a specific table:

#!/bin/bash
function upload_day {
table=$1
sel=$2
day=$3
next_day=$(date -d "$day+1 days" +%Y-%m-%d)
bq_suffix=$(date -d "$day" +%Y%m%d)
echo "Uploading $table: $day..."
psql -c "\\copy (select $sel from $table where created_at >= '$day' and created_at < '$next_day') TO '$table-$day.csv' WITH CSV HEADER"
gzip $table-$day.csv
bq load --allow_quoted_newlines --project_id --replace --source_format=CSV --autodetect --max_bad_records 100 .$table$bq_suffix $table-$day.csv.gz
rm $table-$day.csv.gz
};

This function has three arguments: table, columns for selection, and date to upload. As you can see, it uses the \copy operation to download a CSV from Postgres and then compresses it. The BigQuery docs say the loading of a compressed CSV is slower than uncompressed, but uploading uncompressed data almost always seems slower.

You can call this function by simply adding a line at the end of the script:

upload_day ‘transactions’ ‘*’ ‘2018-03-01’