Category Classification ChatGPT 3.5 vs. Embeddings

Currently, there are two obvious ways to classify text. ChatGPT-turbo 3.5 is probably the most popular at the moment. However, OpenAI recommends the Embeddings API for this purpose:

OpenAI's text embeddings measure the relatedness of text strings. Embeddings are commonly used for:

Search (where results are ranked by relevance to a query string)
Clustering (where text strings are grouped by similarity)
Recommendations (where items with related text strings are recommended)
Anomaly detection (where outliers with little relatedness are identified)
Diversity measurement (where similarity distributions are analyzed)
Classification (where text strings are classified by their most similar label)

At Backlink.nl, we want to categorize more than 3000 websites based on sixty categories. Since it is not clear which of the two models is the most effective, both models were tested using Python. The code used for this, as well as analysis of the results, can be found below.

sum = parseInt(num1) + parseInt(num2)

Code davinci-gpt-3

later edits

fix for 404's
fixed columns
fix for non-www/www

import pandas as pd
import requests
from bs4 import BeautifulSoup
import openai
import csv

# Load the data from the CSV files
cat_df = pd.read_csv("cat.csv")
products_df = pd.read_csv("products.csv")

# Read the categories from the first column (index 0) of cat.csv
categories = cat_df.iloc[:, 0].tolist()

# Set up OpenAI API
openai.api_key = "MYWONDERFULAPIKEY"

# Function to categorize a URL based on the first 200 words
def categorize_url(url):
    try:
        # Scrape the first 200 words of the URL with a 10-second timeout
        response = requests.get(url, timeout=10)
    except requests.exceptions.RequestException:
        # If the first attempt fails, try with the 'www' prefix or without it
        try:
            if "www." in url:
                url = url.replace("www.", "")
            else:
                url = url.replace("://", "://www.")
            response = requests.get(url, timeout=10)
        except requests.exceptions.RequestException as e:
            print(f"Error processing URL '{url}': {e}")
            return []

    # Check if the URL is reachable
    if response.status_code != 200:
        return []

    soup = BeautifulSoup(response.content, "html.parser")
    text = " ".join(soup.stripped_strings)[:200]

    # Perform an OpenAI request to categorize the URL based on the first 200 words
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=f"Categorize the text strictly based on the given categories: Categories: {', '.join(categories)}\n Text: {text}\n Top 3 categories based on this text are:",
        temperature=0.5,
        max_tokens=50,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )

    # Extract the top 3 categories from the response
    top_categories = response.choices[0].text.strip().split(", ")

    return top_categories

# Write the output to output.csv
with open("output.csv", "w", n fewline="") as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(["url", "cat1", "cat2", "cat3", "cat4", "cat5"])

    # Loop through the first 50 rows of products.csv
    for i in range(100):
        product = products_df.iloc[i]

        # Get the URL from the second column (index 1)
        url = product.iloc[1]

        # Categorize the URL
        top_categories = categorize_url(url)

        # Prepare the output
        output = [url] + top_categories + [""] * (5 - len(top_categories))

        # Write the output to the CSV file
        csvwriter.writerow(output)

Fix for categories invented by the model

Especially with models for GPT-4, it is hardly possible to prevent outputs from being written that were not in the provided list of categories. Even with this script, despite trying different prompts indicating that only categories from may be chosen, the model picks a category not on the list 1-2% of the time.

Script for cleaning the row with duplicates:

import csv

with open('outputdirty.csv', 'r') as infile, open('outputclean.csv', 'w', newline='') as outfile:
    reader = csv.reader(infile, delimiter=';')
    writer = csv.writer(outfile, delimiter=';')
    for row in reader:
        entries = row[0].split(';')
        cleaned_entries = []
        for entry in entries:
            if entry not in cleaned_entries:
                cleaned_entries.append(entry)
        cleaned_row = ';'.join(cleaned_entries)
        writer.writerow([cleaned_row])

Endlessly tweaking the prompt may help prevent this, but we are now asking for 3 to 5 categories (the model almost always gives 5) while the first 3 are the most relevant. Therefore, it is easier to compare the output afterwards with the categories in the list and remove all cells where a category is present that does not match the categories. This was done with the following script (it would also have been possible to incorporate the script below in the main script):

import csv

# Read in the categories from cat.csv
with open('cat.csv', 'r') as cat_file:
    cat_reader = csv.reader(cat_file)
    categories = set([row[0] for row in cat_reader])

# Iterate through outputclean.csv and filter out unwanted categories
with open('outputclean.csv', 'r', newline='') as input_file, open('outputfinal.csv', 'w', newline='') as output_file:
    reader = csv.reader(input_file, delimiter=';')
    writer = csv.writer(output_file, delimiter=';')
    for row in reader:
        url = row[0]
        row_categories = set(row[1:])
        # Filter out unwanted categories
        filtered_categories = [c for c in row_categories if any([c == category or c.startswith(category + ';') for category in categories])]
        # Write the row to outputfinal.csv
        writer.writerow([url] + filtered_categories)