Won’t You Be My Neighbor? Quickly Finding Who is Nearby

Many applications these days want us to know how close we are to things:

  • What are the three closest coffee shops to my current location?
  • Which is the nearest airport to the office?
  • What are the two closest subway stops to the restaurant?

and countless more examples.

Another way of asking these questions is to say “who are my nearest neighbors to me?” This maps to a classic algorithmic problem: efficiently finding the K-nearest neighbors (or K-NN), where K is a constant. For example, the first question would be a 3-NN problem as we are trying to find the 3 closest coffee shops.

(If you are interested in learning more about K-NN problems in general, I highly recommend looking at how you can solve this using n-dimensional Voronoi diagrams, a wonderful data structure developed in the field of computational geometry.)

PostgreSQL defines a distance operator for geometric types that looks like this “<->” that, in the case of points, calculates the 2-Dimensional Euclidean distance. For example:

SELECT POINT(0,0) <-> POINT(1,1);

    ?column?
-----------------
 1.4142135623731


.. If we want to find the three friends who were closest to us on October 1, 2012 between 7:00am and 9:00am, we could construct a query like this:

SELECT visitor, visited_at, geocode
FROM visits
WHERE
    visited_at BETWEEN '2012-10-01 07:00' AND '2012-10-01 09:00'
ORDER BY POINT(40.7127263,-74.0066592) <-> geocode
LIMIT 3;


PostgreSQL 9.1 introduced the KNN-GiST index as a way to accelerate searching for neighbors. It has been implemented on several data types, including points and trigrams, and is also leveraged by the PostGIS geospatial extension.

.. You can use a KNN-GiST index simply by creating a GiST index on a supported data type, which in this case, is the geocode column:

CREATE INDEX visits_geocode_gist_idx ON visits USING gist(geocode);
VACUUM ANALYZE visits;

To demonstrate its power, let’s see what happens if we try to find the 3 nearest points to a given location:

EXPLAIN ANALYZE SELECT visitor, visited_at, geocode
FROM visits
ORDER BY POINT(40.7127263,-74.0066592) <-> geocode
LIMIT 3;

Voronoi diagram (Wikipedia)

20 points and their Voronoi cells (larger version below)

In mathematics, a Voronoi diagram is a partitioning of a plane into regions based on distance to points in a specific subset of the plane. That set of points (called seeds, sites, or generators) is specified beforehand, and for each seed there is a corresponding region consisting of all points closer to that seed than to any other. These regions are called Voronoi cells. The Voronoi diagram of a set of points is dual to its Delaunay triangulation.

It is named after Georgy Voronoi, and is also called a Voronoi tessellation, a Voronoi decomposition, a Voronoi partition, or a Dirichlet tessellation (after Peter Gustav Lejeune Dirichlet). Voronoi diagrams have practical and theoretical applications in a large number of fields, mainly in science and technology, but also in visual art.[1][2] They are also known as Thiessen polygons

MACHINE LEARNING IN POSTGRESQL PART 1: KMEANS CLUSTERING

Every person that reads newspapers, magazines or any other media of general interest has at least a basic idea of what Machine Learning is. And this is not only a fashion, Machine Learning is already part of our everyday life and will be much more in the future: from personalized advertisement on the Internet to robot dentists or autonomous cars, Machine Learning seems to be some kind of super power capable of everything.

 

But, what is Machine Learning really? It is mainly a set of statistical algorithms that, based on existing data, are capable of deriving insights out of them. These algorithms are basically divided into two families, supervised and unsupervised learning. In supervised learning, the objective is to perform some kind of prediction, such as, for example, if an e-mail message is spam or not (classification), how many beers will be sold next week in a supermarket (regression), etc. Unsupervised Learning, on the contrary, focuses on answering the question how are my cases divided in groups? What these algorithms do (each of them with their particularities) is to bring similar items as close as possible and keep items that differ  from each other as far as possible.

Step 3: Kmeans in PostgreSQL in a nutshell

Functions written with PL/Python can be called like any other SQL function. As Python has endless libraries for Machine Learning, the integration is very simple. Moreover, apart from giving full support to Python, PL/Python also provides a set of convenience functions to run any parametrized query. So, executing Machine Learning algorithms can be a question of a couple of lines. Let’s take a look

CREATE OR replace FUNCTION kmeans(input_table text, columns text[], clus_num int) RETURNS bytea AS

$$

from pandas import DataFrame
from sklearn.cluster import KMeans
from cPickle import dumps

all_columns = “,”.join(columns)
if all_columns == “”:
all_columns = “*”

rv = plpy.execute(‘SELECT %s FROM %s;’ % (all_columns, plpy.quote_ident(input_table)))

frame = []

for i in rv:
frame.append(i)
df = DataFrame(frame).convert_objects(convert_numeric =True)
kmeans = KMeans(n_clusters=clus_num, random_state=0).fit(df._get_numeric_data())
return dumps(kmeans)

$$ LANGUAGE plpythonu;

K Means Clustering: Redefining Central Midfield Classifications

I only included matches where a player started as a central midfielder. I only included players in my classification that had played at least the equivalent of 10 matches as a central midfielder (900 minutes).

From there, I decided on 25 features to measure the players on. The features span from attributes (eg. height, weight), to positional information (eg. standard deviation of vertical movement), to passing, defense, and shooting metrics. I have tried to include features that cover the majority of a player’s actions during the match.

.. After I had my 25 features, I needed to determine how many different classifications of players I should define. Determining K, or the number of clusters, is a non-standardized task when working with clustering algorithms.