Every person that reads newspapers, magazines or any other media of general interest has at least a basic idea of what Machine Learning is. And this is not only a fashion, Machine Learning is already part of our everyday life and will be much more in the future: from personalized advertisement on the Internet to robot dentists or autonomous cars, Machine Learning seems to be some kind of super power capable of everything.
But, what is Machine Learning really? It is mainly a set of statistical algorithms that, based on existing data, are capable of deriving insights out of them. These algorithms are basically divided into two families, supervised and unsupervised learning. In supervised learning, the objective is to perform some kind of prediction, such as, for example, if an e-mail message is spam or not (classification), how many beers will be sold next week in a supermarket (regression), etc. Unsupervised Learning, on the contrary, focuses on answering the question how are my cases divided in groups? What these algorithms do (each of them with their particularities) is to bring similar items as close as possible and keep items that differ from each other as far as possible.
Step 3: Kmeans in PostgreSQL in a nutshell
Functions written with PL/Python can be called like any other SQL function. As Python has endless libraries for Machine Learning, the integration is very simple. Moreover, apart from giving full support to Python, PL/Python also provides a set of convenience functions to run any parametrized query. So, executing Machine Learning algorithms can be a question of a couple of lines. Let’s take a look
CREATE OR replace FUNCTION kmeans(input_table text, columns text, clus_num int) RETURNS bytea AS
from pandas import DataFrame
from sklearn.cluster import KMeans
from cPickle import dumps
all_columns = “,”.join(columns)
if all_columns == “”:
all_columns = “*”
rv = plpy.execute(‘SELECT %s FROM %s;’ % (all_columns, plpy.quote_ident(input_table)))
frame = 
for i in rv:
df = DataFrame(frame).convert_objects(convert_numeric =True)
kmeans = KMeans(n_clusters=clus_num, random_state=0).fit(df._get_numeric_data())
$$ LANGUAGE plpythonu;
The recommended value is 25% of your total machine RAM. You should try some lower and higher values because in some cases we achieve good performance with a setting over 25%. The configuration really depends on your machine and the working data set. If your working set of data can easily fit into your RAM, then you might want to increase the shared_buffer value to contain your entire database, so that the whole working set of data can reside in cache. That said, you obviously do not want to reserve all RAM for PostgreSQL.
Replication is one of the well-known features that allows us to build an identical copy of a database. It is supported in almost every RDBMS. The advantages of replication may be huge, especially HA (High Availability) and load balancing. But what if we need to build replication between 2 heterogeneous databases like MySQL and PostgreSQL? Can we continuously replicate changes from a MySQL database to a PostgreSQL database? The answer to this question is pg_chameleon.
For replicating continuous changes, pg_chameleon uses the mysql-replication library to pull the row images from MySQL, which are transformed into a jsonb object. A pl/pgsql function in postgres decodes the jsonb and replays the changes into the postgres database. In order to setup this type of replication, your mysql binlog_format must be “ROW”.