Building a Singapore Food Classifier — Part 1: Data, Training, and Why 79% is Harder Than It Sounds
The story of training an image classifier for Singapore hawker food — from scraping images off the internet to figuring out why the model keeps confusing noodle dishes.
The problem: there’s no dataset for this
If you want to classify dogs, there are dozens of datasets. Flowers, cars, birds — all covered. Singapore hawker food? Nothing.
Food-101 exists (101 food categories, 101,000 images), but it has zero Singapore-specific dishes. No chicken rice. No laksa. No char kway teow. The closest matches are generic things like “fried rice” and “ramen” — not useful when you need to distinguish hokkien mee from prawn noodles.
So I had to build the dataset from scratch.
Step 1: Scraping images from DuckDuckGo
I wrote a Python script that searches DuckDuckGo Images for each dish using multiple queries:
"chicken_rice": [
"chicken rice Singapore hawker",
"Hainanese chicken rice plate",
"Singapore chicken rice food",
]
Three queries per dish gives more visual variety than a single search. The script downloads up to 200 images per dish, deduplicates them by MD5 hash, validates they’re real images with Pillow (minimum 80x80 pixels), and splits them 80/10/10 into train/val/test sets.
30 dishes x 200 images = roughly 6,000 images. Total download time: about 20 minutes with rate limiting to avoid getting blocked.
What I got
Most dishes collected 150-200 images without issues. A few were harder:
| Dish | Images collected | Issue |
|---|---|---|
| Chicken Rice | 175 | Easy — very popular dish |
| Hokkien Mee | 96 | Fewer results, mixed with pad thai |
| Duck Rice | 108 | Mixed with Peking duck images |
| Mee Rebus | 105 | Not as commonly photographed |
Good enough to start training. The real problems would show up later.
Step 2: The data is noisy (and that’s normal)
Web-scraped image datasets are always noisy. When you search “carrot cake Singapore hawker”, you get:
- Actual photos of chai tow kway (what we want)
- Photos of Western carrot cake with cream cheese frosting (wrong)
- Restaurant menus and signboards (useless)
- Stock photos of random food (misleading)
- Instagram posts with heavy filters (questionable)
I knew the data needed cleaning, but decided to train first and clean later — let the model’s mistakes tell me where the noise was worst.
Step 3: Training with transfer learning
Why EfficientNetV2-S?
I needed a model that:
- Works well with small datasets (4,000 images, not 4 million)
- Runs fast on CPU (no GPU budget for serving)
- Is small enough to deploy cheaply (78MB weights)
EfficientNetV2-S checks all three boxes. It was pretrained on ImageNet (14 million images across 1,000 classes), so it already understands visual features like edges, textures, colours, and shapes. I just needed to teach it “these features mean chicken rice, those features mean laksa.”
Two-phase training
Rather than fine-tuning the entire network at once, I trained in two phases:
Phase 1 (Epochs 1-9): Head only
- Freeze the entire backbone (the feature extraction layers)
- Only train the new classifier head (1,280 inputs → 30 outputs)
- Learning rate: 1e-3 (relatively aggressive)
- Purpose: let the classifier learn to map ImageNet features to hawker dishes
Phase 2 (Epochs 10-30): Backbone unfreezing
- Unfreeze the last 3 blocks of EfficientNetV2
- Drop learning rate to 1e-4 (10x lower — don’t destroy pretrained features)
- Cosine annealing schedule (learning rate gradually decreases to near zero)
- Purpose: fine-tune the feature extraction for food-specific patterns
This two-phase approach is standard for transfer learning. Phase 1 is fast and gets you to a reasonable baseline. Phase 2 is where the real accuracy gains happen.
Augmentation
Training images are randomly modified each epoch to help the model generalise:
- Random crop (scale 0.7-1.0) — the dish might not be centred
- Horizontal flip — food looks the same left-to-right
- Colour jitter — lighting varies between hawker centres
- Rotation (up to 15 degrees) — photos aren’t always perfectly level
- Label smoothing (0.1) — prevents the model from being overconfident
The training run
Rented an A100 GPU on RunPod (spot instance, ~$3 total):
Device: cuda (A100 80GB)
Classes: 30 | Train: 4,200 | Val: 520
Training for 30 epochs...
Epoch 1: train_loss=2.8 val_acc=31.2%
Epoch 5: train_loss=1.4 val_acc=58.4%
Epoch 10: train_loss=0.9 val_acc=68.1% ← backbone unfrozen
Epoch 15: train_loss=0.5 val_acc=74.6%
Epoch 20: train_loss=0.3 val_acc=77.8%
Epoch 30: train_loss=0.2 val_acc=80.8%
Best val accuracy: 80.8%
45 minutes. Total cost for 2 training runs was $3. Not bad.
Step 4: Evaluating the results
80% accuracy sounds decent until you look at which dishes it’s getting wrong. I evaluated on 531 held-out test images and built a confusion matrix.
Visually distinctive dishes like bak kut teh, chilli crab, kaya toast, and tau huay hit 100% accuracy — you can’t confuse a whole crab with a bowl of noodles. But similar-looking noodle dishes (bak chor mee vs mee pok, hokkien mee vs char kway teow) and category-ambiguous dishes like economy rice dragged accuracy down significantly.
I go into the detailed case-by-case breakdown in Part 3.
Step 5: Data cleaning — the highest-ROI step
Armed with the confusion matrix, I built a browser-based image review tool. It opens a local web server, shows each training image, and lets you keep or delete it with keyboard shortcuts (arrow keys for speed).
I reviewed about 1,500 images across the 10 worst-performing classes:
| Dish | Deleted | What was wrong |
|---|---|---|
| Economy Rice | 88 | Buffet photos, food courts without plates, wrong dishes entirely |
| Carrot Cake | 43 | Western carrot cake images (frosted cake, not fried radish cake) |
| Mee Pok | 35 | Bak chor mee photos mislabelled as mee pok |
| Nasi Goreng | 26 | Generic “fried rice” from non-Singaporean cuisines |
| Prawn Noodles | 25 | Thai/Vietnamese prawn soups, not Singapore-style |
About 300 images removed in total — roughly 7% of the training set.
The retraining
After cleaning, some classes had fewer images (economy rice dropped to 72), so the overall accuracy actually went down slightly to 79.3%. But the per-class changes told a better story:
Improved:
- Mee rebus: 50% → 64.3% (+14.3)
- Nasi lemak: 80% → 93.3% (+13.3)
- Chicken rice: 70% → 80% (+10)
- Duck rice and chendol: both hit 100%
Got worse:
- Economy rice: 70% → 35% (-35) — lost too many training images
The cleaning worked for classes that kept enough data. Economy rice was overcleaned — I removed 55% of its training images and couldn’t collect replacements (DuckDuckGo rate-limited the re-collection).
Lessons learned
1. Data quality beats data quantity
Removing 300 bad images improved accuracy more than the original 6,000 noisy images could have. If I could only do one thing, I’d clean the data before touching the model.
2. Web scraping gives you a dataset, not a good dataset
DuckDuckGo results include wrong dishes, restaurant interiors, menus, stock photos, and heavily filtered Instagram posts. Plan for 20-30% of your scraped data to be useless.
3. Transfer learning is absurdly efficient
78MB model, 4,000 images, 45 minutes of training, $3 — and it correctly identifies 24 out of 30 dishes above 70% accuracy. Five years ago this would have needed tens of thousands of images and days of training.
4. The confusion matrix is your best friend
Overall accuracy (79.3%) hides the real story. The confusion matrix tells you exactly which dishes are confused with which, so you can target your data cleaning where it matters most. I dig into the specific cases in Part 3.