Fastest Way to Load Data Into PostgreSQL Using Python

From two minutes to less than half a second!

Data written to an unlogged table will not be logged to the write-ahead-log (WAL), making it ideal for intermediate tables. Note that <span style="color: #010101;">UNLOGGED</span> tables will not be restored in case of a crash, and will not be replicated.

.. Copy Data From a String Iterator with Buffer Size

In an attempt to squeeze one final drop of performance, we notice that just like <span style="color: #010101;">page_size</span>, the <span style="color: #010101;">copy</span> command also accepts a similar argument called <span style="color: #010101;">size</span>:

size – size of the buffer used to read from the file.

Let’s add a <span style="color: #010101;">size</span> argument to the function:

@profile
def copy_string_iterator(connection, beers: Iterator[Dict[str, Any]], size: int = 8192) -> None:
    with connection.cursor() as cursor:
        create_staging_table(cursor)
        beers_string_iterator = StringIteratorIO((
            '|'.join(map(clean_csv_value, (
                beer['id'],
                beer['name'],
                beer['tagline'],
                parse_first_brewed(beer['first_brewed']).isoformat(),
                beer['description'],
                beer['image_url'],
                beer['abv'],
                beer['ibu'],
                beer['target_fg'],
                beer['target_og'],
                beer['ebc'],
                beer['srm'],
                beer['ph'],
                beer['attenuation_level'],
                beer['brewers_tips'],
                beer['contributed_by'],
                beer['volume']['value'],
            ))) + '\n'
            for beer in beers
        ))
        cursor.copy_from(beers_string_iterator, 'beers', sep='|', size=size)

The default value for size is 8192, which is <span style="color: #010101;">2 ** 13</span>, so we will keep sizes in powers of 2:

>>> copy_string_iterator(connection, iter(beers), size=1024)
copy_string_iterator(size=1024)
Time   0.4536
Memory 0.0

>>> copy_string_iterator(connection, iter(beers), size=8192)
copy_string_iterator(size=8192)
Time   0.4596
Memory 0.0

>>> copy_string_iterator(connection, iter(beers), size=16384)
copy_string_iterator(size=16384)
Time   0.4649
Memory 0.0

>>> copy_string_iterator(connection, iter(beers), size=65536)
copy_string_iterator(size=65536)
Time   0.6171
Memory 0.0

EXPLAIN EXTENDED: INNER JOIN vs. CROSS APPLY

CROSS APPLY is a Microsoft’s extension to SQL, which was originally intended to be used with table-valued functions (TVF‘s).

The query above would look like this:

1
2
3
4
5
6
7
8
9
SELECT  *
FROM    table1
CROSS APPLY
(
SELECT  TOP (table1.rowcount) *
FROM    table2
ORDER BY
id
) t2

Postgres equivalent to CROSS APPLY in SQL Server

The Slow Climb of Postgres and the Value of Persistence

Those intrepid individuals trying to bring a new innovation into the world, or a new start-up, should keep in mind the power of perseverance. Even if a technology is a hit, the creator must still face any number of hurdles to maintain and grow its success.

Attendees at the PostgresVision conference in Boston this week certainly heard a great testimonial for this idea, as database pioneer Dr. Michael Stonebraker offered an account of how he helped bring PostgreSQL database into the world, a 15 year journey of amazing highs and lows.

Stonebraker had already helped birth the first world’s first relational database, Ingres, a decade before, but he was back in the laboratory in early 1980s to create Postgres (later amended to “PostgreSQL” to reflect its SQL powers). Ingres was such a success. In fact, that users wanted to add their more arbitrary data types and operators to its schema, for use cases such as geolocation, he explained. The initial workaround to this problem with Ingres to encapsulate that logic within application code, often with the horribly unintuitive Prolog.

So the idea with Postgres, which debuted in 1984, would be to offer a relational database system with support for “advanced data types” (ADTs). ADTs set the stage for much wider use of relational databases, the new capability — built on object-relational mapping — saving untold hours in developer effort. Postgres was also the first database system not to overwrite data when an update was needed, and instead, just add a new entry and delete the old one, which opened the door to capturing a verifiable transactional history.

But Postgres was anything but an immediate success, no matter how obvious its value. First Stonebraker and his team had the seemingly good idea to write this new database in the best possible language of the day, LISP. LISP, however, turned out to be a terrible choice, because of its slow performance — not a characteristic you’d want in a database. So a rewrite into the much more performant C was necessary.

They also ran into a potentially-crippling catch 22. Early users wanted not just the crude ADTs created by the Postgres developers. They wanted sophisticated ADTs to work with the commercial applications they used, such as geo-mapping software from Arcvinfo. But when they approached companies such as ArcInfo and asked for their help, they were told the Postgres user base wasn’t then large enough to warrant such an effort. The lack of commercial ADTs would keep Postgres on the sideline, yet software vendors weren’t interested in creating ADTs for their own products no matter how useful they would be to Postgres end-users.

The database designers were seemingly stuck, at least until another use-case came into view, supporting the then-nascent but rapidly-growing Internet, and gave Postgres a bountiful user-base. In 1996, Stonebraker sold the company he formed, Illustra, to commercialize the database system, and went on to create other databases for other use-cases, such as analytics oriented Vertica.

These days, Postgres is the currently fourth most widely-used database system, according to the DB-Engines Ranking, just ahead of NoSQL MongoDB. EnterpriseDB, which offers commercial support for the open source database (and is the organizing sponsor for Postgres Vision),  sponsored an independently-executed benchmark test to compare Postgres and MongoDB, which was just enhanced with multi-transactional capabilities.

Despite MongoDB’s fast performance, the benchmarking company, Ongres, found that MongoDB had difficulty matching the transactional performance of Postgres. And, with more than 16 threads, Postgres left MongoDB in the dust entirely.

“PostgreSQL 11 was found to be faster than MongoDB 4.0 in almost every benchmark,” the report concluded. “Throughput was higher, ranging from dozens of percentage points up to one and even two orders of magnitude on some benchmarks. Latency, when measured by the benchmark, was also lower on PostgreSQL.”

Work still needs to be done to give Postgres multi-master capability, which would make it truly scalable, across multiple servers. Still, its presence in today’s business stack is not too shabby for software that is 35 years old.

Good ideas never come from “communing with nature on a mountaintop,” Stonebraker said. They come from talking with users, many users. Serendipity always helps (and anti-serendipity always hurts), but that is beyond your control.

And most importantly, Stonebraker reminds us that success comes from persistence, weathering the inevitable lows as well as the highs.

Static and dynamic pivots

How to do a dynamic pivot

So the difficulty of a dynamic pivot is: in an SQL query, the output columns must be determined before execution. But to know which columns are formed from transposing rows, we’d to need to execute the query first. To solve this chicken and egg problem, we need to loosen the constraints a bit.

Result in a nested structure inside a column

Firstly, a SQL query can return the pivoted part encapsulated inside a single column with a composite or array type, rather than as multiple columns. This type could be array[text], JSON, XML… This solution is used by Oracle with its PIVOT XML clause. That’s a one-step method, which is good, but the result has a non-tabular structure that does necessarily match the users expectations.

Here is an example using modern PostgreSQL with JSON:

<span style="color: #111111;"><span class="k">SELECT</span> <span class="n">city</span><span class="p">,</span></span>
<span style="color: #111111;">       <span class="n">json_object_agg</span><span class="p">(</span><span class="k">year</span><span class="p">,</span><span class="n">total</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="k">year</span><span class="p">)</span></span>
<span style="color: #111111;">   <span class="k">FROM</span> <span class="p">(</span></span>
<span style="color: #111111;">     <span class="k">SELECT</span> <span class="n">city</span><span class="p">,</span> <span class="k">year</span><span class="p">,</span> <span class="k">SUM</span><span class="p">(</span><span class="n">raindays</span><span class="p">)</span> <span class="k">AS</span> <span class="n">total</span></span>
<span style="color: #111111;">        <span class="k">FROM</span> <span class="n">rainfall</span></span>
<span style="color: #111111;">        <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">city</span><span class="p">,</span><span class="k">year</span></span>
<span style="color: #111111;">   <span class="p">)</span> <span class="n">s</span></span>
<span style="color: #111111;">  <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">city</span></span>
<span style="color: #111111;">  <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">city</span><span class="p">;</span></span>

Without having to enumerate the years in the query, we get the same data as above, but as 2 columns, the first for the “vertical axis”, and the next one for all the rest in JSON format:

<span style="color: #111111;">   city    |                                    json_object_agg                                     </span>
<span style="color: #111111;">-----------+----------------------------------------------------------------------------------------</span>
<span style="color: #111111;"> Ajaccio   | { "2012" : 69, "2013" : 91, "2014" : 78, "2015" : 48, "2016" : 81, "2017" : 51 }</span>
<span style="color: #111111;"> Bordeaux  | { "2012" : 116, "2013" : 138, "2014" : 137, "2015" : 101, "2016" : 117, "2017" : 110 }</span>
<span style="color: #111111;"> Brest     | { "2012" : 178, "2013" : 161, "2014" : 180, "2015" : 160, "2016" : 165, "2017" : 144 }</span>
<span style="color: #111111;"> Dijon     | { "2012" : 114, "2013" : 124, "2014" : 116, "2015" : 93, "2016" : 116, "2017" : 103 }</span>
<span style="color: #111111;"> Lille     | { "2012" : 153, "2013" : 120, "2014" : 136, "2015" : 128, "2016" : 138, "2017" : 113 }</span>
<span style="color: #111111;"> Lyon      | { "2012" : 112, "2013" : 116, "2014" : 111, "2015" : 80, "2016" : 110, "2017" : 102 }</span>
<span style="color: #111111;"> Marseille | { "2012" : 47, "2013" : 63, "2014" : 68, "2015" : 53, "2016" : 54, "2017" : 43 }</span>
<span style="color: #111111;"> Metz      | { "2012" : 98, "2013" : 120, "2014" : 110, "2015" : 93, "2016" : 122, "2017" : 115 }</span>
<span style="color: #111111;"> Nantes    | { "2012" : 124, "2013" : 132, "2014" : 142, "2015" : 111, "2016" : 106, "2017" : 110 }</span>
<span style="color: #111111;"> Nice      | { "2012" : 53, "2013" : 77, "2014" : 78, "2015" : 50, "2016" : 52, "2017" : 43 }</span>
<span style="color: #111111;"> Paris     | { "2012" : 114, "2013" : 111, "2014" : 113, "2015" : 85, "2016" : 120, "2017" : 110 }</span>
<span style="color: #111111;"> Perpignan | { "2012" : 48, "2013" : 56, "2014" : 54, "2015" : 48, "2016" : 69, "2017" : 48 }</span>
<span style="color: #111111;"> Toulouse  | { "2012" : 86, "2013" : 116, "2014" : 111, "2015" : 83, "2016" : 102, "2017" : 89 }</span>
<span style="color: #111111;">(13 rows)
</span>

That somehow does the job, but visually the alignment is not very good, and if we want to copy-paste this result into a spreadsheet or import it as tabular data, it’s clear that it’s not going to work.

Getting a tabular result in two steps

The other solutions based on a SQL query are based on the idea of a two-step process:

  1. a first query build the result with all its columns, and returns an indirect reference to this result.
  2. a second query returns the result set, now that its structure is known by the SQL engine thanks to the previous step.