Maximum (usable) number of rows in a Postgresql table

The rule is: Try and find the most common working set. See if it fits in RAM. Optimize hardware, PG/OS buffer settings and PG indexes/clustering for it. Otherwise look for aggregates, or if it’s not acceptable and you need fully random access, think what hardware could scan the whole table for you in reasonable time.

.. Are there any columns which are commonly used for filtering, such as state or date? Can you the working set that is most commonly used (like only last month)? If so, consider partitioning or clustering on these columns, and definitely index them. Basically, you’re trying to make sure that as much of the working set as possible fits in RAM.

Avoid scanning the table at all costs if it does not fit in RAM.

Postgres: Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT

The ‘MV’ in MVCC stands for Multi Version. This means that multiple versions of the same data will be kept any time that data changes. Oracle does this by rolling old data into an “undo log.” PostgreSQL doesn’t use an undo log; instead it keeps multiple versions of data in the base tables. This means that there is much less overhead when making updates, and you must occasionally remove the old versions. This is one of the things VACUUM does.

The way PostgreSQL manages these multiple versions is by storing some extra information with every row. This information is used to determine what transactions should be able to see the row. If the row is an old version, there is information that tells PostgreSQL where to find the new version of the row.

.. Consider this scenario: a row is inserted into a table that has a couple indexes, and that transaction commits. Several updates happen on that row. Each update will create a new row in all indexes, even if the index key didn’t change. And each update will also leave an old version of the row in the base table, one that has been updated to point to the location of the new version of the row that replaces it. All of the old data will stick around until the vacuum is run on that table

.. When the database needs to add new data to a table as the result of an INSERT or UPDATE, it needs to find someplace to store that data. There are 3 ways it could do this:

  1. Scan through the table to find some free space
  2. Just add the information to the end of the table
  3. Remember what pages in the table have free space available, and use one of them

Option 1 would obviously be extremely slow. Imagine potentially reading the entire table every time you wanted to add or update data! Option 2 is fast, but it would result in the table growing in size every time you added a row. That leaves option 3, which is where the FSM comes in.

.. The only way pages are put into the FSM is via a VACUUM.

.. you’ll recall that PostgreSQL’s MVCC (Multi-Version Concurrency Control) does away with the need for expensive read locks by keeping multiple versions of table rows that have been updated, and not immediately removing deleted rows. This is done by storing ‘visibility information’ in each row. But for performance reasons, this information is not stored in indexes. This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn’t been deleted.