FINE TUNING FULL TEXT SEARCH WITH POSTGRESQL 12

FLEXING POSTGRES 12

Trying to decide between to_tsqueryplainto_tsquery and phraseto_tsquery can be difficult. It was kind of straightforward in our case – we’re not searching on any phrases really.

The Postgres team decided to be helpful in this regard, especially when it comes to web applications, so they created websearch_to_tsquery. It basically treats the input as if it were entered into a Google search. To be dead honest I have no idea what’s happening under the covers here, but it’s supposed to be a bit more intelligent than plainto_tsquery and a little less strict than phraseto_tsquery.

Datatype for a URL in PostgreSQL

This is an extension for PostgreSQL that provides a uri data type. Advantages over using plain text for storing URIs include:

  • URI syntax checking
  • functions for extracting URI components
  • human-friendly sorting

The actual URI parsing is provided by the uriparser library, which supports URI syntax as per RFC 3986.

Note that this might not be the right data type to use if you want to store user-provided URI data, such as HTTP referrers, since they might contain arbitrary junk.

Postgres TOAST

TOAST is a mechanism PostgreSQL uses to keep physical data rows from exceeding the size of a data block (typically 8KB). Postgres does not support physical rows that cross block boundaries, so the block size is a hard upper limit on row size. To allow user tables to have rows wider than this, the TOAST mechanism breaks up wide field values into smaller pieces, which are stored “out of line” in a TOAST table associated with the user table.

Each table you create has its own associated (unique) TOAST table, which may or may not ever end up being used, depending on the size of rows you insert. All of this is transparent to the user, and enabled by default.

When a row that is to be stored is “too wide” (the threshold for that is 2KB by default), the TOAST mechanism first attempts to compress any wide field values. If that isn’t enough to get the row under 2KB, it breaks up the wide field values into chunks that get stored in the associated TOAST table. Each original field value is replaced by a small pointer that shows where to find this “out of line” data in the TOAST table. TOAST will attempt to squeeze the user-table row down to 2KB in this way, but as long as it can get below 8KB, that’s good enough and the row can be stored successfully.

All standard Postgres data types that could possibly have values wider than 2KB support being “TOASTed” in this way, and so do most potentially-wide extension data types.

You can view the current TOAST options for a table by opening psql and running

\d+ table_name
for example:

=> \d+ test_table_name
Table “name_space.test_table_name”
Column | Type | Modifiers | Storage | Stats target | Description
——–+——————-+———–+———-+————–+————-
foo | character(100000) | | extended | |
You can modify the storage type like this:

alter table test_blob alter column column_name set storage EXTENDED;

The default TOAST option is EXTENDED, which tells the system to first attempt to compress the data for that column. If that isn’t enough, then it out-of-line’s it to the TOAST table. There are other options that you can pick if you don’t want compression or out-of-line storage. Be careful about disabling out-of-line storage entirely, because that will cause an attempt to store an oversize row to fail altogether. (Note that columns that are of non-TOASTable data types will show the storage option as “plain”, and you can’t change it.)

Announcing Yugabyte DB 2.0 GA: Jepsen Tested, High-Performance Distributed SQL

We are excited to announce the general availability of Yugabyte DB 2.0! The highlight of this release is that it delivers production readiness for Yugabyte SQL (YSQL), our high-performance, fully-relational distributed SQL API. For those of you new to distributed SQL, Yugabyte DB is a Google Spanner-inspired, cloud-native distributed SQL database that is 100% open source. It puts a premium on high performance, data resilience, geographic distribution while ensuring PostgreSQL compatibility.