Home » The Challenges: Beware of Categories and Texts

The Challenges: Beware of Categories and Texts

K-Means doesn’t directly understand categories like “Country” (‘Spain’, ‘France’), “Device” (‘Mobile’, ‘Tablet’), or “Customer Type” (‘New’, ‘Recurring’). If you want to use this information for panama phone number data  clustering, you’ll need to convert it to numeric format first.


>>>>Techniques like one-hot encoding  (creating a 0/1 binary column for each category) can help, although BQML doesn’t do this automatically in K-Means (it does in other types of models). This adds complexity to data preparation.

However, it’s not advisable to perform these transformations without following a scale. For example, if you use the category IDs from your website (let’s assume your website has 200 categories with IDs ranging from 1 to 200), K-Means will understand that category ID 1 is very close to category ID 2, and that doesn’t necessarily have to be true. If you are going to work with one-hot-encoding, try sorting the data by similarities (for example, you could sort the country IDs by turnover or by geographic coordinates or by some calculation of cultural degrees)

  • Free Text:
    Forget about directly entering text from a product new ifrs concept – what awaits us? description, article content, or a search query. K-Means can’t calculate distances on raw text. You need to extract numerical features from that text (e.g., length, number of specific keywords) or, the most advanced and powerful solution, convert the text into embeddings (numerical vectors), as we mentioned before
    .
  • IDs and Dates: Don’t use unique identifiers (`user_id`, `product_sku`, `order_id`) as metrics for clustering. They have no numerical meaning for grouping! Don’t use dates directly either; if you want to use temporal information, calculate numerical metrics from them (e.g., “days since last visit,” “month of purchase”)
    .
  • A Common Mistake:
    Trying to cram everything you have in the table into the `SELECT` of the `CREATE MODEL` statement. Remember! Only the numerical metrics that define the similarity you’re looking for. But even if you have a lot  gambler data of numerical metrics, it’s quite likely that many of them won’t make sense to consider for clustering
  • IKAUE’s tip: Before launching CREATE MODEL, think carefully: Which elements do I want to group? And based on what measurable *numerical characteristics* are those elements similar or different? If you have questions about how to prepare your categorical or textual data, or which metrics to choose, now’s a good time to ask us!

Scroll to Top