Quick and Dirty Address Matching with LibPostal

  • While it is not a 100% solution, using normalized addresses and full text search provides a relatively fast (less than 100ms) matching approach for loose address matching.

libpostal only has two operations, “address normalization” and “address parsing”, that are exposed by pgsql-postal with the <span style="color: #000000;">postal_normalize()</span> and <span style="color: #000000;">postal_parse()</span> functions.

Normalization takes an address string and converts it to all the standard forms that “make sense”. For example:

<span class="token keyword keyword-SELECT">SELECT</span> unnest<span class="token punctuation">(</span>postal_normalize<span class="token punctuation">(</span><span class="token string">'390 Greenwich St., New york, ny, 10013'</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token number">390</span> greenwich saint new york ny <span class="token number">10013</span>
<span class="token number">390</span> greenwich saint new york new york <span class="token number">10013</span>
<span class="token number">390</span> greenwich street new york ny <span class="token number">10013</span>
<span class="token number">390</span> greenwich street new york new york <span class="token number">10013</span>

Parsing takes apart a string into address components and returns a JSONB of those components:

<span class="token keyword keyword-SELECT">SELECT</span> jsonb_pretty<span class="token punctuation">(</span>postal_parse<span class="token punctuation">(</span><span class="token string">'390 Greenwich St., New york, ny, 10013'</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
{                           
<span class="token string">"city"</span>: <span class="token string">"new york"</span><span class="token punctuation">,</span>     
<span class="token string">"road"</span>: <span class="token string">"greenwich st."</span><span class="token punctuation">,</span>
<span class="token string">"state"</span>: <span class="token string">"ny"</span><span class="token punctuation">,</span>          
<span class="token string">"postcode"</span>: <span class="token string">"10013"</span><span class="token punctuation">,</span>    
<span class="token string">"house_number"</span>: <span class="token string">"390"</span>   
}

We can use normalization to create a table of text searchable address strings, and then use full text search to efficiently search that table for potential matches for new addresses.

International Normalization

As we saw above, normalization takes raw address strings and turns them into “possible standard forms”, which are suitable for searching against. They aren’t necessarily the best forms, more regionally-aware parsing can do a better job of standard North American parsing and formatting, but where libpostal shines is being a ready-to-run fully international solution that doesn’t even need to be told what language it is working on.

For example, this address in Berlin:

<span class="token keyword keyword-SELECT">SELECT</span> unnest<span class="token punctuation">(</span>postal_normalize<span class="token punctuation">(</span><span class="token string">'Potsdamer Straße 3, 10785 Berlin, Germany'</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
potsdamer strasse <span class="token number">3</span> <span class="token number">10785</span> berlin germany