Home » Getting Started: K-Means with SQL in BigQuery ML

Getting Started: K-Means with SQL in BigQuery ML

Okay. So we get the idea of what K-means is and what we’re trying to create Right? Now let’s get to the best part: doing it is surprisingly easy in BigQuery . You don’t need to install Python libraries or set nigeria phone number data  up complex infrastructure. Everything is done with a couple of SQL commands. For those of us who use BigQuery for almost all of our data, this is a joy.

The first step is to “train” or create the K-Means model. We tell BigQuery what data to use, how many clusters we want (our ‘K’), and a few other options. This will generate the centroids in the model so that we can then call it for any data (existing or future) and it will tell us which cluster it belongs to and how far it is from all the centroids.

The basic syntax is like this:

 

Let’s analyze the `OPTIONS` we have used to understand them:

  • model_type='KMEANS':
    Tells BigQuery that we want a K-Means model.
  • num_clusters=5:
    Here you specify the ‘K’. This is the number of groups it will try to find.
  • kmeans_init_method='KMEANS++':
    This is how the algorithm chooses the starting points for the centroids. ‘KMEANS++’ is generally better than a completely random start (‘RANDOM’). Leave it as KMEANS++ and don’t give it any more thought
    .
  • standardize_features=TRUEThis is key! Imagine using “sessions” (which can be in the thousands) and “conversion rate” (which is typically a small number, like 0.02). Without standardization, sessions would completely dominate the distance calculation, and conversion rate would have
    little impact. Setting this to TRUE causes BigQuery to internally scale all metrics to have similar importance in clustering. You’ll almost always want to use this. Only in a case where you standardize metrics to make some more important than others might you not want to enable it
    .

And in the `SELECT`…
be careful! Here you should only include the numeric columns  expert tips from trainer jessica stokes you want the algorithm to use to measure the similarity between your elements. Don’t include IDs, names, URLs, or dates directly here (we’ll see how to preserve these in your queries later).

Don’t you have a table with data directly prepared to provide the numbers with which to create the clusters?

It doesn’t matter. As you can see in the query. The SELECT part is actually a SELECT, so you can calculate or recalculate the data without any problem. However, if you do it this way, keep in mind that every time you want to calculate the cluster to which a piece of data belongs, you’ll have to repeat this SELECT exactly the same.

Step 2: Apply the model to classify the data (`ML.PREDICT`)

Once the previous query has finished (it may take a while if you have a lot of data). You have your K-Means model created.

You can now use it to assign each of your data rows to one of the clusters. This is done with the `ML.PREDICT` function:

This query takes your newly created model and the data table ( it’s important that it has the same numerical metrics you used for training ) and adds new columns to the result. Yes, the function simply respects your SELECT statement and adds the clustering columns.

The most important columns generated by `ML.PREDICT` for K-Means are:

  • NEAREST_CENTROID_ID:
    A number (starting with 1) indicating the cluster ID to which each element (row) has been gambler data  assigned. This is the cluster label we were looking for
    !
  • CENTROID_DISTANCES:
    An array (a list) of numbers. It contains the Euclidean distance from the element to the centroid of each of the ‘K’ clusters. The first number is the distance to centroid 1, the second to centroid 2. And so on
    .

Now let’s see how to work with this data output.

Scroll to Top