Vision LLMs for Inventory Control

Manual inventory management is a pain. Staff manually counting items, typing product names, and dealing with inevitable typos - it's slow, error-prone, and often means your inventory records are outdated before they're even saved. But what if you could just point a camera at your warehouse shelf and have everything automatically identified and logged into your ERP system?

That's exactly what we built. This project uses Vision Language Models (VLMs) to transform how businesses track inventory, turning a tedious manual process into something that happens in real-time with a quick photo.

Why This Matters

Traditional inventory management has three major problems:

Human Error: Typos in product names and miscounts create inconsistent data
Time Lag: Records get updated hours or days after stock arrives
Wasted Effort: Staff spend valuable time on data entry instead of core tasks

By using computer vision, we eliminate all three issues. The system identifies products from images and syncs them instantly with your ERP.

How It Works

The solution is a full-stack TypeScript application designed for reliability and real-world use. Here's the stack:

Frontend (React 19): Built with Vite, uses the browser's Camera API for high-resolution captures. Clean, responsive UI with Tailwind CSS and shadcn/ui that works on mobile and desktop
Backend (NestJS): Handles image processing, API authentication, and ERP synchronization with robust business logic
Database (PostgreSQL): Uses the pg_trgm extension for fuzzy matching - critical for aligning AI-detected names with existing products
AI (Vision LLM): Currently using Gemini 3 Flash for exceptional speed and accuracy. With Google's free tier offering 1,500 requests/day, it's incredibly cost-effective for small to medium businesses

The Secret Sauce: Prompt Engineering

Getting structured data from the AI is crucial. We use a carefully crafted prompt that enforces a strict JSON schema, making the response easy to parse without fragile string manipulation. This prompt works across industries - groceries, hardware, electronics, medical supplies, you name it.

Analyze this image and identify all visible inventory items or products.
For each item, provide:
1. Product name (be specific, include size or variant if visible)
2. Brand (if visible on the packaging)
3. Category (e.g., "Industrial", "Dairy", "Office Supplies")
4. Quantity or Size (e.g., "1kg", "12 units", "Box of 50")
5. Confidence score (a number from 0 to 100 based on your certainty)

Return ONLY a valid JSON array. Do not include markdown formatting or extra text.
Structure:
[
  {
    "name": "string",
    "brand": "string or null",
    "category": "string",
    "quantity": "string or null",
    "confidence": number
  }
]

Implementation: The Two-Service Approach

Vision Service

The VisionService talks to the Vision LLM and processes base64-encoded images:

import { Injectable } from '@nestjs/common';
import { GoogleGenAI } from '@google/genai';

@Injectable()
export class VisionService {
  private client: GoogleGenAI;

  constructor() {
    this.client = new GoogleGenAI({ 
      apiKey: process.env.GEMINI_API_KEY 
    });
  }

  async analyzeImage(imageBase64: string) {
    const base64Data = imageBase64.replace(/^data:image\/\w+;base64,/, '');

    const result = await this.client.models.generateContent({
      model: 'gemini-3-flash',
      contents: [{
        role: 'user',
        parts: [
          { text: VISION_PROMPT },
          { 
            inlineData: { 
              data: base64Data, 
              mimeType: 'image/jpeg' 
            } 
          }
        ]
      }]
    });

    const text = result.text;
    if (!text) throw new Error('No response from Vision LLM');

    const jsonMatch = text.match(/\[[\s\S]*\]/);
    if (!jsonMatch) throw new Error('Invalid JSON format');

    return JSON.parse(jsonMatch[0]);
  }
}

Smart Product Matching

The AI rarely gives us exact database matches. The ProductMatchingService uses a three-tier approach to handle this:

async findBestMatch(item: DetectedItem): Promise<ProductMatch> {
  const exactMatch = await this.productRepository.findOne({
    where: { 
      name: item.name, 
      brand: item.brand 
    }
  });

  if (exactMatch) {
    return { 
      productId: exactMatch.id, 
      matchScore: 100, 
      isManualAddNeeded: false 
    };
  }

  const fuzzyMatches = await this.productRepository.query(
    `SELECT *, similarity(name, $1) as score 
     FROM products 
     WHERE similarity(name, $1) > 0.3 
     ORDER BY score DESC 
     LIMIT 1`,
    [item.name]
  );

  if (fuzzyMatches.length > 0) {
    const score = Math.round(fuzzyMatches[0].score * 100);
    return { 
      productId: fuzzyMatches[0].id, 
      matchScore: score, 
      isManualAddNeeded: score < 75 
    };
  }

  return { 
    productId: null, 
    matchScore: 0, 
    isManualAddNeeded: true 
  };
}

Real-World Impact

This isn't just a proof of concept - it's deployed and running in production. Here's what we've seen:

Massive Time Savings: Scanning a shelf takes seconds instead of minutes of manual entry
Better Data Quality: AI + fuzzy matching reduces inventory "noise" significantly
Smart Workflows: Items with confidence scores below 75% get flagged for human review automatically
Real-Time Visibility: ERP reflects physical stock changes almost instantly, enabling better forecasting

What's Next: Going Local with Ollama

While cloud-based Vision LLMs like Gemini 3 Flash are great for production, there's growing interest in running these models locally for privacy, cost control, or offline operation. That's where Ollama comes in.

Why Run Vision Models Locally?

Privacy First: Your inventory data never leaves your servers
Zero API Costs: No per-request charges or rate limits
Offline Capability: Works without internet connectivity

Use can refer to ollama.com/search?c=vision For vision LLMs

Conclusion

Vision LLMs are transforming inventory management from a manual chore into an automated, real-time process. Whether you choose cloud-based models like Gemini 3 Flash for their speed and accuracy, or local models via Ollama for privacy and cost control, the technology is here and ready for production use.

The future? Even better. As models get smaller and faster, we'll see more edge deployment, multi-camera automation, and seamless ERP integration becoming the standard rather than the exception.

Want to try it yourself? Try here at github.com/yo9e5h/inventory-vision-demo

Real-Time Inventory Management with Vision LLMs

Why This Matters

How It Works

The Secret Sauce: Prompt Engineering

Implementation: The Two-Service Approach

Vision Service

Smart Product Matching

Real-World Impact

What's Next: Going Local with Ollama

Why Run Vision Models Locally?

Conclusion

Comments

More from this blog

React 19.2: Activity Component and useEffectEvent Hook

Build a Privacy-First AI Chatbot with Ollama, TanStack AI, and React: The Complete 2026 Guide

How I Built an AI-Powered Exam Question Generator with Laravel and React

Why you should use Zustand over Redux.

Command Palette

Why This Matters

How It Works

The Secret Sauce: Prompt Engineering

Implementation: The Two-Service Approach

Vision Service

Smart Product Matching

Real-World Impact

What's Next: Going Local with Ollama

Why Run Vision Models Locally?

Conclusion

Comments

More from this blog