Building a Singapore Food Classifier — Part 1: Data, Training, and Why 79% is Harder Than It Sounds


The story of training an image classifier for Singapore hawker food — from scraping images off the internet to figuring out why the model keeps confusing noodle dishes.


The problem: there’s no dataset for this

If you want to classify dogs, there are dozens of datasets. Flowers, cars, birds — all covered. Singapore hawker food? Nothing.

Food-101 exists (101 food categories, 101,000 images), but it has zero Singapore-specific dishes. No chicken rice. No laksa. No char kway teow. The closest matches are generic things like “fried rice” and “ramen” — not useful when you need to distinguish hokkien mee from prawn noodles.

So I had to build the dataset from scratch.


Step 1: Scraping images from DuckDuckGo

I wrote a Python script that searches DuckDuckGo Images for each dish using multiple queries:

"chicken_rice": [
    "chicken rice Singapore hawker",
    "Hainanese chicken rice plate",
    "Singapore chicken rice food",
]

Three queries per dish gives more visual variety than a single search. The script downloads up to 200 images per dish, deduplicates them by MD5 hash, validates they’re real images with Pillow (minimum 80x80 pixels), and splits them 80/10/10 into train/val/test sets.

30 dishes x 200 images = roughly 6,000 images. Total download time: about 20 minutes with rate limiting to avoid getting blocked.

What I got

Most dishes collected 150-200 images without issues. A few were harder:

DishImages collectedIssue
Chicken Rice175Easy — very popular dish
Hokkien Mee96Fewer results, mixed with pad thai
Duck Rice108Mixed with Peking duck images
Mee Rebus105Not as commonly photographed

Good enough to start training. The real problems would show up later.


Step 2: The data is noisy (and that’s normal)

Web-scraped image datasets are always noisy. When you search “carrot cake Singapore hawker”, you get:

  • Actual photos of chai tow kway (what we want)
  • Photos of Western carrot cake with cream cheese frosting (wrong)
  • Restaurant menus and signboards (useless)
  • Stock photos of random food (misleading)
  • Instagram posts with heavy filters (questionable)

I knew the data needed cleaning, but decided to train first and clean later — let the model’s mistakes tell me where the noise was worst.


Step 3: Training with transfer learning

Why EfficientNetV2-S?

I needed a model that:

  1. Works well with small datasets (4,000 images, not 4 million)
  2. Runs fast on CPU (no GPU budget for serving)
  3. Is small enough to deploy cheaply (78MB weights)

EfficientNetV2-S checks all three boxes. It was pretrained on ImageNet (14 million images across 1,000 classes), so it already understands visual features like edges, textures, colours, and shapes. I just needed to teach it “these features mean chicken rice, those features mean laksa.”

Two-phase training

Rather than fine-tuning the entire network at once, I trained in two phases:

Phase 1 (Epochs 1-9): Head only

  • Freeze the entire backbone (the feature extraction layers)
  • Only train the new classifier head (1,280 inputs → 30 outputs)
  • Learning rate: 1e-3 (relatively aggressive)
  • Purpose: let the classifier learn to map ImageNet features to hawker dishes

Phase 2 (Epochs 10-30): Backbone unfreezing

  • Unfreeze the last 3 blocks of EfficientNetV2
  • Drop learning rate to 1e-4 (10x lower — don’t destroy pretrained features)
  • Cosine annealing schedule (learning rate gradually decreases to near zero)
  • Purpose: fine-tune the feature extraction for food-specific patterns

This two-phase approach is standard for transfer learning. Phase 1 is fast and gets you to a reasonable baseline. Phase 2 is where the real accuracy gains happen.

Augmentation

Training images are randomly modified each epoch to help the model generalise:

  • Random crop (scale 0.7-1.0) — the dish might not be centred
  • Horizontal flip — food looks the same left-to-right
  • Colour jitter — lighting varies between hawker centres
  • Rotation (up to 15 degrees) — photos aren’t always perfectly level
  • Label smoothing (0.1) — prevents the model from being overconfident

The training run

Rented an A100 GPU on RunPod (spot instance, ~$3 total):

Device: cuda (A100 80GB)
Classes: 30 | Train: 4,200 | Val: 520
Training for 30 epochs...

Epoch 1:  train_loss=2.8  val_acc=31.2%
Epoch 5:  train_loss=1.4  val_acc=58.4%
Epoch 10: train_loss=0.9  val_acc=68.1%  ← backbone unfrozen
Epoch 15: train_loss=0.5  val_acc=74.6%
Epoch 20: train_loss=0.3  val_acc=77.8%
Epoch 30: train_loss=0.2  val_acc=80.8%

Best val accuracy: 80.8%

45 minutes. Total cost for 2 training runs was $3. Not bad.


Step 4: Evaluating the results

80% accuracy sounds decent until you look at which dishes it’s getting wrong. I evaluated on 531 held-out test images and built a confusion matrix.

Visually distinctive dishes like bak kut teh, chilli crab, kaya toast, and tau huay hit 100% accuracy — you can’t confuse a whole crab with a bowl of noodles. But similar-looking noodle dishes (bak chor mee vs mee pok, hokkien mee vs char kway teow) and category-ambiguous dishes like economy rice dragged accuracy down significantly.

I go into the detailed case-by-case breakdown in Part 3.


Step 5: Data cleaning — the highest-ROI step

Armed with the confusion matrix, I built a browser-based image review tool. It opens a local web server, shows each training image, and lets you keep or delete it with keyboard shortcuts (arrow keys for speed).

I reviewed about 1,500 images across the 10 worst-performing classes:

DishDeletedWhat was wrong
Economy Rice88Buffet photos, food courts without plates, wrong dishes entirely
Carrot Cake43Western carrot cake images (frosted cake, not fried radish cake)
Mee Pok35Bak chor mee photos mislabelled as mee pok
Nasi Goreng26Generic “fried rice” from non-Singaporean cuisines
Prawn Noodles25Thai/Vietnamese prawn soups, not Singapore-style

About 300 images removed in total — roughly 7% of the training set.

The retraining

After cleaning, some classes had fewer images (economy rice dropped to 72), so the overall accuracy actually went down slightly to 79.3%. But the per-class changes told a better story:

Improved:

  • Mee rebus: 50% → 64.3% (+14.3)
  • Nasi lemak: 80% → 93.3% (+13.3)
  • Chicken rice: 70% → 80% (+10)
  • Duck rice and chendol: both hit 100%

Got worse:

  • Economy rice: 70% → 35% (-35) — lost too many training images

The cleaning worked for classes that kept enough data. Economy rice was overcleaned — I removed 55% of its training images and couldn’t collect replacements (DuckDuckGo rate-limited the re-collection).


Lessons learned

1. Data quality beats data quantity

Removing 300 bad images improved accuracy more than the original 6,000 noisy images could have. If I could only do one thing, I’d clean the data before touching the model.

2. Web scraping gives you a dataset, not a good dataset

DuckDuckGo results include wrong dishes, restaurant interiors, menus, stock photos, and heavily filtered Instagram posts. Plan for 20-30% of your scraped data to be useless.

3. Transfer learning is absurdly efficient

78MB model, 4,000 images, 45 minutes of training, $3 — and it correctly identifies 24 out of 30 dishes above 70% accuracy. Five years ago this would have needed tens of thousands of images and days of training.

4. The confusion matrix is your best friend

Overall accuracy (79.3%) hides the real story. The confusion matrix tells you exactly which dishes are confused with which, so you can target your data cleaning where it matters most. I dig into the specific cases in Part 3.


Try the live demo

Next: Part 2: Deployment and Architecture