From two minutes to less than half a second!
Data written to an unlogged table will not be logged to the write-ahead-log (WAL), making it ideal for intermediate tables. Note that
<span style="color: #010101;">UNLOGGED</span>
tables will not be restored in case of a crash, and will not be replicated... Copy Data From a String Iterator with Buffer Size▶
In an attempt to squeeze one final drop of performance, we notice that just like
<span style="color: #010101;">page_size</span>
, the<span style="color: #010101;">copy</span>
command also accepts a similar argument called<span style="color: #010101;">size</span>
:size – size of the buffer used to read from the file.
Let’s add a
<span style="color: #010101;">size</span>
argument to the function:@profile def copy_string_iterator(connection, beers: Iterator[Dict[str, Any]], size: int = 8192) -> None: with connection.cursor() as cursor: create_staging_table(cursor) beers_string_iterator = StringIteratorIO(( '|'.join(map(clean_csv_value, ( beer['id'], beer['name'], beer['tagline'], parse_first_brewed(beer['first_brewed']).isoformat(), beer['description'], beer['image_url'], beer['abv'], beer['ibu'], beer['target_fg'], beer['target_og'], beer['ebc'], beer['srm'], beer['ph'], beer['attenuation_level'], beer['brewers_tips'], beer['contributed_by'], beer['volume']['value'], ))) + '\n' for beer in beers )) cursor.copy_from(beers_string_iterator, 'beers', sep='|', size=size)The default value for size is 8192, which is
<span style="color: #010101;">2 ** 13</span>
, so we will keep sizes in powers of 2:>>> copy_string_iterator(connection, iter(beers), size=1024) copy_string_iterator(size=1024) Time 0.4536 Memory 0.0 >>> copy_string_iterator(connection, iter(beers), size=8192) copy_string_iterator(size=8192) Time 0.4596 Memory 0.0 >>> copy_string_iterator(connection, iter(beers), size=16384) copy_string_iterator(size=16384) Time 0.4649 Memory 0.0 >>> copy_string_iterator(connection, iter(beers), size=65536) copy_string_iterator(size=65536) Time 0.6171 Memory 0.0
EXPLAIN EXTENDED: INNER JOIN vs. CROSS APPLY
CROSS APPLY
is a Microsoft’s extension to SQL, which was originally intended to be used with table-valued functions (TVF‘s).The query above would look like this:
123456789SELECT
*
FROM
table1
CROSS
APPLY
(
SELECT
TOP
(table1.rowcount) *
FROM
table2
ORDER
BY
id
) t2
The Slow Climb of Postgres and the Value of Persistence
Those intrepid individuals trying to bring a new innovation into the world, or a new start-up, should keep in mind the power of perseverance. Even if a technology is a hit, the creator must still face any number of hurdles to maintain and grow its success.
Attendees at the PostgresVision conference in Boston this week certainly heard a great testimonial for this idea, as database pioneer Dr. Michael Stonebraker offered an account of how he helped bring PostgreSQL database into the world, a 15 year journey of amazing highs and lows.
Stonebraker had already helped birth the first world’s first relational database, Ingres, a decade before, but he was back in the laboratory in early 1980s to create Postgres (later amended to “PostgreSQL” to reflect its SQL powers). Ingres was such a success. In fact, that users wanted to add their more arbitrary data types and operators to its schema, for use cases such as geolocation, he explained. The initial workaround to this problem with Ingres to encapsulate that logic within application code, often with the horribly unintuitive Prolog.
So the idea with Postgres, which debuted in 1984, would be to offer a relational database system with support for “advanced data types” (ADTs). ADTs set the stage for much wider use of relational databases, the new capability — built on object-relational mapping — saving untold hours in developer effort. Postgres was also the first database system not to overwrite data when an update was needed, and instead, just add a new entry and delete the old one, which opened the door to capturing a verifiable transactional history.
But Postgres was anything but an immediate success, no matter how obvious its value. First Stonebraker and his team had the seemingly good idea to write this new database in the best possible language of the day, LISP. LISP, however, turned out to be a terrible choice, because of its slow performance — not a characteristic you’d want in a database. So a rewrite into the much more performant C was necessary.
They also ran into a potentially-crippling catch 22. Early users wanted not just the crude ADTs created by the Postgres developers. They wanted sophisticated ADTs to work with the commercial applications they used, such as geo-mapping software from Arcvinfo. But when they approached companies such as ArcInfo and asked for their help, they were told the Postgres user base wasn’t then large enough to warrant such an effort. The lack of commercial ADTs would keep Postgres on the sideline, yet software vendors weren’t interested in creating ADTs for their own products no matter how useful they would be to Postgres end-users.
Static and dynamic pivots
How to do a dynamic pivot
So the difficulty of a dynamic pivot is: in an SQL query, the output columns must be determined before execution. But to know which columns are formed from transposing rows, we’d to need to execute the query first. To solve this chicken and egg problem, we need to loosen the constraints a bit.
Result in a nested structure inside a column
Firstly, a SQL query can return the pivoted part encapsulated inside a single column with a composite or array type, rather than as multiple columns. This type could be array[text], JSON, XML… This solution is used by Oracle with its PIVOT XML clause. That’s a one-step method, which is good, but the result has a non-tabular structure that does necessarily match the users expectations.
Here is an example using modern PostgreSQL with JSON:
Without having to enumerate the years in the query, we get the same data as above, but as 2 columns, the first for the “vertical axis”, and the next one for all the rest in JSON format:
<span style="color: #111111;"> city | json_object_agg </span> <span style="color: #111111;">-----------+----------------------------------------------------------------------------------------</span> <span style="color: #111111;"> Ajaccio | { "2012" : 69, "2013" : 91, "2014" : 78, "2015" : 48, "2016" : 81, "2017" : 51 }</span> <span style="color: #111111;"> Bordeaux | { "2012" : 116, "2013" : 138, "2014" : 137, "2015" : 101, "2016" : 117, "2017" : 110 }</span> <span style="color: #111111;"> Brest | { "2012" : 178, "2013" : 161, "2014" : 180, "2015" : 160, "2016" : 165, "2017" : 144 }</span> <span style="color: #111111;"> Dijon | { "2012" : 114, "2013" : 124, "2014" : 116, "2015" : 93, "2016" : 116, "2017" : 103 }</span> <span style="color: #111111;"> Lille | { "2012" : 153, "2013" : 120, "2014" : 136, "2015" : 128, "2016" : 138, "2017" : 113 }</span> <span style="color: #111111;"> Lyon | { "2012" : 112, "2013" : 116, "2014" : 111, "2015" : 80, "2016" : 110, "2017" : 102 }</span> <span style="color: #111111;"> Marseille | { "2012" : 47, "2013" : 63, "2014" : 68, "2015" : 53, "2016" : 54, "2017" : 43 }</span> <span style="color: #111111;"> Metz | { "2012" : 98, "2013" : 120, "2014" : 110, "2015" : 93, "2016" : 122, "2017" : 115 }</span> <span style="color: #111111;"> Nantes | { "2012" : 124, "2013" : 132, "2014" : 142, "2015" : 111, "2016" : 106, "2017" : 110 }</span> <span style="color: #111111;"> Nice | { "2012" : 53, "2013" : 77, "2014" : 78, "2015" : 50, "2016" : 52, "2017" : 43 }</span> <span style="color: #111111;"> Paris | { "2012" : 114, "2013" : 111, "2014" : 113, "2015" : 85, "2016" : 120, "2017" : 110 }</span> <span style="color: #111111;"> Perpignan | { "2012" : 48, "2013" : 56, "2014" : 54, "2015" : 48, "2016" : 69, "2017" : 48 }</span> <span style="color: #111111;"> Toulouse | { "2012" : 86, "2013" : 116, "2014" : 111, "2015" : 83, "2016" : 102, "2017" : 89 }</span> <span style="color: #111111;">(13 rows) </span>
That somehow does the job, but visually the alignment is not very good, and if we want to copy-paste this result into a spreadsheet or import it as tabular data, it’s clear that it’s not going to work.
Getting a tabular result in two steps
The other solutions based on a SQL query are based on the idea of a two-step process:
- a first query build the result with all its columns, and returns an indirect reference to this result.
- a second query returns the result set, now that its structure is known by the SQL engine thanks to the previous step.