MACHINE LEARNING IN POSTGRESQL PART 1: KMEANS CLUSTERING

Every person that reads newspapers, magazines or any other media of general interest has at least a basic idea of what Machine Learning is. And this is not only a fashion, Machine Learning is already part of our everyday life and will be much more in the future: from personalized advertisement on the Internet to robot dentists or autonomous cars, Machine Learning seems to be some kind of super power capable of everything.

But, what is Machine Learning really? It is mainly a set of statistical algorithms that, based on existing data, are capable of deriving insights out of them. These algorithms are basically divided into two families, supervised and unsupervised learning. In supervised learning, the objective is to perform some kind of prediction, such as, for example, if an e-mail message is spam or not (classification), how many beers will be sold next week in a supermarket (regression), etc. Unsupervised Learning, on the contrary, focuses on answering the question how are my cases divided in groups? What these algorithms do (each of them with their particularities) is to bring similar items as close as possible and keep items that differ from each other as far as possible.

Step 3: Kmeans in PostgreSQL in a nutshell

Functions written with PL/Python can be called like any other SQL function. As Python has endless libraries for Machine Learning, the integration is very simple. Moreover, apart from giving full support to Python, PL/Python also provides a set of convenience functions to run any parametrized query. So, executing Machine Learning algorithms can be a question of a couple of lines. Let’s take a look

CREATE OR replace FUNCTION kmeans(input_table text, columns text[], clus_num int) RETURNS bytea AS

$$

from pandas import DataFrame
from sklearn.cluster import KMeans
from cPickle import dumps

all_columns = “,”.join(columns)
if all_columns == “”:
all_columns = “*”

rv = plpy.execute(‘SELECT %s FROM %s;’ % (all_columns, plpy.quote_ident(input_table)))

frame = []

for i in rv:
frame.append(i)
df = DataFrame(frame).convert_objects(convert_numeric =True)
kmeans = KMeans(n_clusters=clus_num, random_state=0).fit(df._get_numeric_data())
return dumps(kmeans)

$$ LANGUAGE plpythonu;