Text Classification for Survey Data Using SetFit

NLP

2026-04-10

12 min read

irufano

Ctrl+F

Contents

Thumbnail Credit

What is SetFit?

SetFit (Sentence Transformer Fine-Tuning) is a framework for few-shot text classification. You only need 8-16 labeled examples per category to train a high-quality classifier. It works well with multilingual data (Bahasa + English).

Step 1: Installation

Using uv (recommended)

pyproject.toml:

              
              
            

          [project]
dependencies = [
    "datasets>=4.8.4",
    "setfit>=1.1.3",
    "torch>=2.11.0",
    "transformers<5",
    "pandas",
    "openpyxl",
    "scikit-learn",
    "matplotlib",
]
          

Then run:

uv sync

Using pip

              
          pip install "setfit>=1.1.3" "transformers<5" "datasets>=4.8.4" "torch>=2.11.0" pandas openpyxl scikit-learn matplotlib

Important

transformers<5 is required to avoid the default_logdir import error.

Step 2: Download Base Model to Local (Optional but Recommended)

Download the model once, then load from local to avoid re-downloading every time:

              
          from sentence_transformers import SentenceTransformer

# Download and save to local folder (run once)
model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
model.save("./models/paraphrase-multilingual-MiniLM-L12-v2")

Then use the local path in SetFit:

              
          from setfit import SetFitModel

# Load from local path instead of HuggingFace Hub
model_col1 = SetFitModel.from_pretrained(
    "./models/paraphrase-multilingual-MiniLM-L12-v2"
)

Step 3: Prepare Your Labeled Data

You need to manually label a small sample from your data.

Column 1: Suggestion Classification

Label ~8-16 examples per category:

              
              
            

          from datasets import Dataset

# Example labeled data for Column 1
train_data_col1 = Dataset.from_dict({
    "text": [
        # === suggestion (label: 1) ===
        "Tambahkan fitur timer agar peserta tahu sisa waktu",
        "Sebaiknya ada practice test sebelum ujian dimulai",
        "Improve the UI, it's hard to navigate",
        "Akan lebih baik jika soal bisa di-review sebelum submit",
        "Tolong perbaiki loading time, terlalu lama",
        "Please add a progress bar",
        "Mungkin bisa ditambahkan instruksi yang lebih jelas",
        "Saran saya, buat tampilan lebih user friendly",

        # === not-suggestion (label: 0) ===
        "Tidak ada",
        "-",
        "Sudah bagus",
        "No suggestion",
        "N/A",
        "Oke semua",
        "Nothing to add",
        "Sudah cukup baik",
    ],
    "label": [1, 1, 1, 1, 1, 1, 1, 1,
              0, 0, 0, 0, 0, 0, 0, 0],
})

# Label mapping
col1_labels = {0: "not-suggestion", 1: "suggestion"}
          

Column 2: Feedback Classification

Label ~8-16 examples per category (4 categories × 4 examples minimum):

              
              
            

          train_data_col2 = Dataset.from_dict({
    "text": [
        # === good no input (label: 0) ===
        "Bagus",
        "Good experience",
        "Sudah oke",
        "No feedback, everything is fine",

        # === good with input (label: 1) ===
        "Bagus, tapi waktu pengerjaan bisa ditambah sedikit",
        "Good overall, the instructions were clear and helpful",
        "Pengalaman baik, UI nya intuitif dan mudah dipahami",
        "Great assessment, especially the coding section was well designed",

        # === bad no input (label: 2) ===
        "Jelek",
        "Bad experience",
        "Kurang bagus",
        "Not good",

        # === bad with input (label: 3) ===
        "Pengalaman buruk karena loading sangat lambat dan sering error",
        "Bad experience, the timer was too short for the number of questions",
        "Kurang bagus, soalnya terlalu banyak dan tidak relevan",
        "Poor experience because the system crashed twice during my test",
    ],
    "label": [0, 0, 0, 0,
              1, 1, 1, 1,
              2, 2, 2, 2,
              3, 3, 3, 3],
})

col2_labels = {
    0: "good no input",
    1: "good with input",
    2: "bad no input",
    3: "bad with input",
}
          

Tip

Use real examples from YOUR data for better accuracy.
Pick clear, unambiguous examples for training.

Step 4: Prepare Evaluation Data

Create a separate labeled dataset (do NOT reuse training data):

              
              
            

          # Evaluation data for Column 1
eval_data_col1 = Dataset.from_dict({
    "text": [
        # Examples NOT used in training
        "Mohon tambahkan dark mode",          # suggestion
        "Bisa ditambah fitur bookmark",       # suggestion
        "Sudah baik",                         # not-suggestion
        "Tidak ada saran",                    # not-suggestion
        "Please make the font bigger",        # suggestion
        "-",                                  # not-suggestion
        "Sebaiknya waktu ditambah",           # suggestion
        "Everything is fine",                 # not-suggestion
    ],
    "label": [1, 1, 0, 0, 1, 0, 1, 0],
})

# Evaluation data for Column 2
eval_data_col2 = Dataset.from_dict({
    "text": [
        "Mantap",                                          # good no input
        "Oke lah",                                         # good no input
        "Bagus, soalnya relevan dengan posisi",             # good with input
        "Good, clear instructions and fair time limit",     # good with input
        "Buruk",                                           # bad no input
        "Disappointing",                                   # bad no input
        "Jelek, soalnya tidak relevan dan waktu kurang",   # bad with input
        "Bad, the system lagged and I lost my answers",    # bad with input
    ],
    "label": [0, 0, 1, 1, 2, 2, 3, 3],
})
          

Step 5: Train the Models

Train Column 1 Model (Suggestion Classifier)

              
              
            

          from setfit import SetFitModel, Trainer, TrainingArguments

# Load from local path (see Step 2) or from HuggingFace Hub
BASE_MODEL = "./models/paraphrase-multilingual-MiniLM-L12-v2"
# Or use HuggingFace Hub directly:
# BASE_MODEL = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"

model_col1 = SetFitModel.from_pretrained(BASE_MODEL)

args = TrainingArguments(
    batch_size=8,
    num_epochs=3,
    num_iterations=20,
    eval_strategy="epoch",  # evaluate after each epoch
)

trainer_col1 = Trainer(
    model=model_col1,
    args=args,
    train_dataset=train_data_col1,
    eval_dataset=eval_data_col1,
    metric="accuracy",
)

trainer_col1.train()

# Save the trained model
model_col1.save_pretrained("./models/model_suggestion")
          

Train Column 2 Model (Feedback Classifier)

              
          model_col2 = SetFitModel.from_pretrained(BASE_MODEL)

trainer_col2 = Trainer(
    model=model_col2,
    args=args,
    train_dataset=train_data_col2,
    eval_dataset=eval_data_col2,
    metric="accuracy",
)

trainer_col2.train()

# Save the trained model
model_col2.save_pretrained("./models/model_feedback")

Step 6: Evaluate the Models

Quick Accuracy Check

              
          # Evaluate Column 1
metrics_col1 = trainer_col1.evaluate(eval_data_col1)
print(f"Column 1 Accuracy: {metrics_col1['accuracy']:.2%}")

# Evaluate Column 2
metrics_col2 = trainer_col2.evaluate(eval_data_col2)
print(f"Column 2 Accuracy: {metrics_col2['accuracy']:.2%}")

Detailed Metrics (Precision, Recall, F1)

              
              
            

          from sklearn.metrics import classification_report

#  Column 1 
preds_col1 = model_col1.predict(eval_data_col1["text"])

print("=== Column 1: Suggestion Classification ===")
print(classification_report(
    eval_data_col1["label"],
    preds_col1,
    target_names=["not-suggestion", "suggestion"]
))

#  Column 2 
preds_col2 = model_col2.predict(eval_data_col2["text"])

print("=== Column 2: Feedback Classification ===")
print(classification_report(
    eval_data_col2["label"],
    preds_col2,
    target_names=["good no input", "good with input", "bad no input", "bad with input"]
))
          

Output example:

=== Column 1: Suggestion Classification ===
                precision    recall  f1-score   support

not-suggestion       0.88      0.87      0.88        4
    suggestion       0.88      0.88      0.88        4

      accuracy                           0.88        8
     macro avg       0.88      0.88      0.88        8
  weighted avg       0.88      0.88      0.88        8

Confusion Matrix

              
              
            

          from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

#  Column 1 Confusion Matrix 
cm1 = confusion_matrix(eval_data_col1["label"], preds_col1)
disp1 = ConfusionMatrixDisplay(cm1, display_labels=["not-suggestion", "suggestion"])
disp1.plot()
plt.title("Column 1 - Suggestion Classification")
plt.tight_layout()
plt.show()

#  Column 2 Confusion Matrix 
cm2 = confusion_matrix(eval_data_col2["label"], preds_col2)
disp2 = ConfusionMatrixDisplay(
    cm2,
    display_labels=["good no input", "good with input", "bad no input", "bad with input"]
)
disp2.plot()
plt.title("Column 2 - Feedback Classification")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
          

What to Look At

Metric	What it Tells You
Accuracy	Overall correctness
Precision	Of all predicted as X, how many were actually X
Recall	Of all actual X, how many were predicted as X
F1-score	Balance of precision and recall (most important for imbalanced data)
Confusion Matrix	Exactly where the model makes mistakes

Step 7: Predict on Your Full Dataset

              
              
            

          import pandas as pd

# Load your data
df = pd.read_excel("your_data.xlsx")  # <-- change filename

# Column names
col1 = "What would you suggest to improve the Online Assessment experience? (It is okay to answer in Bahasa)"
col2 = "Please share any additional input or feedback regarding your experience with this Online Assessment.  (It is okay to answer in Bahasa)"

# Handle missing/empty values
df[col1] = df[col1].fillna("").astype(str).str.strip()
df[col2] = df[col2].fillna("").astype(str).str.strip()

#  Load saved models 
model_col1 = SetFitModel.from_pretrained("./models/model_suggestion")
model_col2 = SetFitModel.from_pretrained("./models/model_feedback")

#  Predict Column 1 
texts_col1 = df[col1].replace("", "Tidak ada").tolist()
predictions_col1 = model_col1.predict(texts_col1)
df["suggestion_flag"] = [col1_labels[int(p)] for p in predictions_col1]

#  Predict Column 2 
texts_col2 = df[col2].replace("", "Tidak ada").tolist()
predictions_col2 = model_col2.predict(texts_col2)
df["feedback_flag"] = [col2_labels[int(p)] for p in predictions_col2]

#  Save results 
df.to_excel("results_classified.xlsx", index=False)

print("\nDone! Results saved to results_classified.xlsx")
print("\n Summary ")
print("\nSuggestion flags:")
print(df["suggestion_flag"].value_counts())
print("\nFeedback flags:")
print(df["feedback_flag"].value_counts())
          

Step 8: Improve Accuracy

If accuracy is low, here's how to improve:

1. Add more training examples

Focus on patterns the model gets wrong (check the confusion matrix):

              
              
            

          # Add more examples to training data
additional_data = Dataset.from_dict({
    "text": [
        # Add examples the model misclassified
        "Mungkin bisa diperbaiki tampilannya",   # suggestion
        "Cukup",                                  # not-suggestion
    ],
    "label": [1, 0],
})

# Combine with original training data
from datasets import concatenate_datasets
train_data_col1 = concatenate_datasets([train_data_col1, additional_data])
          

2. Increase training iterations

              
          args = TrainingArguments(
    batch_size=8,
    num_epochs=3,
    num_iterations=40,  # increased from 20
    eval_strategy="epoch",
)

3. Try a larger base model

# Larger model = better accuracy, slower speed
model = SetFitModel.from_pretrained(
    "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
)

4. Review random samples from predictions

              
          # Sample 50 random rows to manually review
sample = df.sample(50, random_state=42)
sample[
    [col1, "suggestion_flag", col2, "feedback_flag"]
].to_excel("review_sample.xlsx", index=False)

Full Pipeline (Copy-Paste Ready)

              
              
            

          """
Complete SetFit Classification Pipeline
For survey feedback analysis (Bahasa + English)

Requirements (pyproject.toml):
    datasets>=4.8.4
    setfit>=1.1.3
    torch>=2.11.0
    transformers<5
    pandas
    openpyxl
    scikit-learn
    matplotlib

Install with: uv sync
"""

import pandas as pd
from datasets import Dataset
from setfit import SetFitModel, Trainer, TrainingArguments
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# ============================================================
# 1. CONFIG
# ============================================================

# Use local model path (see Step 2) or HuggingFace Hub
BASE_MODEL = "./models/paraphrase-multilingual-MiniLM-L12-v2"
# BASE_MODEL = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"

# ============================================================
# 2. PREPARE TRAINING DATA
#    Replace these examples with REAL samples from your data!
# ============================================================

train_col1 = Dataset.from_dict({
    "text": [
        # suggestion (1)
        "Tambahkan fitur timer agar peserta tahu sisa waktu",
        "Sebaiknya ada practice test sebelum ujian dimulai",
        "Improve the UI, it's hard to navigate",
        "Akan lebih baik jika soal bisa di-review sebelum submit",
        "Tolong perbaiki loading time, terlalu lama",
        "Please add a progress bar",
        "Mungkin bisa ditambahkan instruksi yang lebih jelas",
        "Saran saya, buat tampilan lebih user friendly",
        # not-suggestion (0)
        "Tidak ada", "-", "Sudah bagus", "No suggestion",
        "N/A", "Oke semua", "Nothing to add", "Sudah cukup baik",
    ],
    "label": [1,1,1,1,1,1,1,1, 0,0,0,0,0,0,0,0],
})

train_col2 = Dataset.from_dict({
    "text": [
        # good no input (0)
        "Bagus", "Good experience", "Sudah oke", "No feedback",
        # good with input (1)
        "Bagus, tapi waktu pengerjaan bisa ditambah",
        "Good overall, instructions were clear and helpful",
        "Pengalaman baik, UI intuitif dan mudah dipahami",
        "Great assessment, coding section was well designed",
        # bad no input (2)
        "Jelek", "Bad experience", "Kurang bagus", "Not good",
        # bad with input (3)
        "Buruk karena loading lambat dan sering error",
        "Bad, timer was too short for the questions",
        "Kurang bagus, soalnya terlalu banyak",
        "Poor, system crashed twice during my test",
    ],
    "label": [0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3],
})

col1_labels = {0: "not-suggestion", 1: "suggestion"}
col2_labels = {
    0: "good no input",
    1: "good with input",
    2: "bad no input",
    3: "bad with input",
}

# ============================================================
# 3. PREPARE EVALUATION DATA
# ============================================================

eval_col1 = Dataset.from_dict({
    "text": [
        "Mohon tambahkan dark mode",
        "Bisa ditambah fitur bookmark",
        "Sudah baik",
        "Tidak ada saran",
        "Please make the font bigger",
        "-",
        "Sebaiknya waktu ditambah",
        "Everything is fine",
    ],
    "label": [1, 1, 0, 0, 1, 0, 1, 0],
})

eval_col2 = Dataset.from_dict({
    "text": [
        "Mantap",
        "Oke lah",
        "Bagus, soalnya relevan dengan posisi",
        "Good, clear instructions and fair time limit",
        "Buruk",
        "Disappointing",
        "Jelek, soalnya tidak relevan dan waktu kurang",
        "Bad, the system lagged and I lost my answers",
    ],
    "label": [0, 0, 1, 1, 2, 2, 3, 3],
})

# ============================================================
# 4. TRAIN MODELS
# ============================================================

args = TrainingArguments(
    batch_size=8,
    num_epochs=3,
    num_iterations=20,
    eval_strategy="epoch",
)

print("Training Column 1 model (suggestion)...")
model_col1 = SetFitModel.from_pretrained(BASE_MODEL)
trainer_col1 = Trainer(
    model=model_col1,
    args=args,
    train_dataset=train_col1,
    eval_dataset=eval_col1,
    metric="accuracy",
)
trainer_col1.train()
model_col1.save_pretrained("./models/model_suggestion")

print("\nTraining Column 2 model (feedback)...")
model_col2 = SetFitModel.from_pretrained(BASE_MODEL)
trainer_col2 = Trainer(
    model=model_col2,
    args=args,
    train_dataset=train_col2,
    eval_dataset=eval_col2,
    metric="accuracy",
)
trainer_col2.train()
model_col2.save_pretrained("./models/model_feedback")

# ============================================================
# 5. EVALUATE MODELS
# ============================================================

#  Accuracy 
metrics_col1 = trainer_col1.evaluate(eval_col1)
metrics_col2 = trainer_col2.evaluate(eval_col2)
print(f"\nColumn 1 Accuracy: {metrics_col1['accuracy']:.2%}")
print(f"Column 2 Accuracy: {metrics_col2['accuracy']:.2%}")

#  Detailed Report 
preds_col1 = model_col1.predict(eval_col1["text"])
preds_col2 = model_col2.predict(eval_col2["text"])

print("\n=== Column 1: Suggestion Classification ===")
print(classification_report(
    eval_col1["label"], preds_col1,
    target_names=["not-suggestion", "suggestion"]
))

print("\n=== Column 2: Feedback Classification ===")
print(classification_report(
    eval_col2["label"], preds_col2,
    target_names=["good no input", "good with input", "bad no input", "bad with input"]
))

#  Confusion Matrices 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

cm1 = confusion_matrix(eval_col1["label"], preds_col1)
ConfusionMatrixDisplay(cm1, display_labels=["not-suggestion", "suggestion"]).plot(ax=axes[0])
axes[0].set_title("Column 1 - Suggestion")

cm2 = confusion_matrix(eval_col2["label"], preds_col2)
ConfusionMatrixDisplay(
    cm2,
    display_labels=["good\nno input", "good\nwith input", "bad\nno input", "bad\nwith input"]
).plot(ax=axes[1])
axes[1].set_title("Column 2 - Feedback")

plt.tight_layout()
plt.savefig("confusion_matrices.png", dpi=150)
plt.show()

# ============================================================
# 6. PREDICT ON FULL DATASET
# ============================================================

# Load your data
df = pd.read_excel("your_data.xlsx")  # <-- change filename

col1 = "suggestion"
col2 = "feedback"

df[col1] = df[col1].fillna("").astype(str).str.strip()
df[col2] = df[col2].fillna("").astype(str).str.strip()

# Predict
preds1 = model_col1.predict(df[col1].replace("", "Tidak ada").tolist())
df["suggestion_flag"] = [col1_labels[int(p)] for p in preds1]

preds2 = model_col2.predict(df[col2].replace("", "Tidak ada").tolist())
df["feedback_flag"] = [col2_labels[int(p)] for p in preds2]

# Save results
df.to_excel("results_classified.xlsx", index=False)
print("\nDone! Results saved to results_classified.xlsx")
print("\n Summary ")
print("\nSuggestion flags:")
print(df["suggestion_flag"].value_counts())
print("\nFeedback flags:")
print(df["feedback_flag"].value_counts())
          

Get Confidence Score

Use predict_proba instead of predict to get confidence scores:

              
          # predict gives labels only
predictions = model_col1.predict(["Tambahkan fitur timer"])
print(predictions)  # [1]

# predict_proba gives confidence scores per class
probabilities = model_col1.predict_proba(["Tambahkan fitur timer"])
print(probabilities)
# [[0.12, 0.88]]  
# meaning: 12% not-suggestion, 88% suggestion

Key Tips

Use REAL examples from your data — the tutorial examples are just placeholders
More examples = better accuracy — 8 per class is minimum, 16-20 is better
Pick clear examples — avoid ambiguous cases for training data
Empty/blank responses — always handle with .fillna("") before predicting
Training time — usually 2-5 minutes on CPU for small training sets
Prediction speed — ~100-500 rows/second depending on hardware
Local models — download once, load from local path to avoid re-downloading
transformers<5 — pin this to avoid the default_logdir import error
Evaluate before deploying — always check accuracy on held-out data first
Confusion matrix — shows exactly where the model fails, so you know what training examples to add

Tags:

NLP

SetFit