Real-Time Inventory Management with Vision LLMs

Manual inventory management is a pain. Staff manually counting items, typing product names, and dealing with inevitable typos - it's slow, error-prone, and often means your inventory records are outdated before they're even saved. But what if you could just point a camera at your warehouse shelf and have everything automatically identified and logged into your ERP system?
That's exactly what we built. This project uses Vision Language Models (VLMs) to transform how businesses track inventory, turning a tedious manual process into something that happens in real-time with a quick photo.
Why This Matters
Traditional inventory management has three major problems:
Human Error: Typos in product names and miscounts create inconsistent data
Time Lag: Records get updated hours or days after stock arrives
Wasted Effort: Staff spend valuable time on data entry instead of core tasks
By using computer vision, we eliminate all three issues. The system identifies products from images and syncs them instantly with your ERP.
How It Works
The solution is a full-stack TypeScript application designed for reliability and real-world use. Here's the stack:
Frontend (React 19): Built with Vite, uses the browser's Camera API for high-resolution captures. Clean, responsive UI with Tailwind CSS and shadcn/ui that works on mobile and desktop
Backend (NestJS): Handles image processing, API authentication, and ERP synchronization with robust business logic
Database (PostgreSQL): Uses the
pg_trgmextension for fuzzy matching - critical for aligning AI-detected names with existing productsAI (Vision LLM): Currently using Gemini 3 Flash for exceptional speed and accuracy. With Google's free tier offering 1,500 requests/day, it's incredibly cost-effective for small to medium businesses
The Secret Sauce: Prompt Engineering
Getting structured data from the AI is crucial. We use a carefully crafted prompt that enforces a strict JSON schema, making the response easy to parse without fragile string manipulation. This prompt works across industries - groceries, hardware, electronics, medical supplies, you name it.
Analyze this image and identify all visible inventory items or products.
For each item, provide:
1. Product name (be specific, include size or variant if visible)
2. Brand (if visible on the packaging)
3. Category (e.g., "Industrial", "Dairy", "Office Supplies")
4. Quantity or Size (e.g., "1kg", "12 units", "Box of 50")
5. Confidence score (a number from 0 to 100 based on your certainty)
Return ONLY a valid JSON array. Do not include markdown formatting or extra text.
Structure:
[
{
"name": "string",
"brand": "string or null",
"category": "string",
"quantity": "string or null",
"confidence": number
}
]
Implementation: The Two-Service Approach
Vision Service
The VisionService talks to the Vision LLM and processes base64-encoded images:
import { Injectable } from '@nestjs/common';
import { GoogleGenAI } from '@google/genai';
@Injectable()
export class VisionService {
private client: GoogleGenAI;
constructor() {
this.client = new GoogleGenAI({
apiKey: process.env.GEMINI_API_KEY
});
}
async analyzeImage(imageBase64: string) {
const base64Data = imageBase64.replace(/^data:image\/\w+;base64,/, '');
const result = await this.client.models.generateContent({
model: 'gemini-3-flash',
contents: [{
role: 'user',
parts: [
{ text: VISION_PROMPT },
{
inlineData: {
data: base64Data,
mimeType: 'image/jpeg'
}
}
]
}]
});
const text = result.text;
if (!text) throw new Error('No response from Vision LLM');
const jsonMatch = text.match(/\[[\s\S]*\]/);
if (!jsonMatch) throw new Error('Invalid JSON format');
return JSON.parse(jsonMatch[0]);
}
}
Smart Product Matching
The AI rarely gives us exact database matches. The ProductMatchingService uses a three-tier approach to handle this:
async findBestMatch(item: DetectedItem): Promise<ProductMatch> {
const exactMatch = await this.productRepository.findOne({
where: {
name: item.name,
brand: item.brand
}
});
if (exactMatch) {
return {
productId: exactMatch.id,
matchScore: 100,
isManualAddNeeded: false
};
}
const fuzzyMatches = await this.productRepository.query(
`SELECT *, similarity(name, $1) as score
FROM products
WHERE similarity(name, $1) > 0.3
ORDER BY score DESC
LIMIT 1`,
[item.name]
);
if (fuzzyMatches.length > 0) {
const score = Math.round(fuzzyMatches[0].score * 100);
return {
productId: fuzzyMatches[0].id,
matchScore: score,
isManualAddNeeded: score < 75
};
}
return {
productId: null,
matchScore: 0,
isManualAddNeeded: true
};
}
Real-World Impact
This isn't just a proof of concept - it's deployed and running in production. Here's what we've seen:
Massive Time Savings: Scanning a shelf takes seconds instead of minutes of manual entry
Better Data Quality: AI + fuzzy matching reduces inventory "noise" significantly
Smart Workflows: Items with confidence scores below 75% get flagged for human review automatically
Real-Time Visibility: ERP reflects physical stock changes almost instantly, enabling better forecasting
What's Next: Going Local with Ollama
While cloud-based Vision LLMs like Gemini 3 Flash are great for production, there's growing interest in running these models locally for privacy, cost control, or offline operation. That's where Ollama comes in.
Why Run Vision Models Locally?
Privacy First: Your inventory data never leaves your servers
Zero API Costs: No per-request charges or rate limits
Offline Capability: Works without internet connectivity
Use can refer to ollama.com/search?c=vision For vision LLMs
Conclusion
Vision LLMs are transforming inventory management from a manual chore into an automated, real-time process. Whether you choose cloud-based models like Gemini 3 Flash for their speed and accuracy, or local models via Ollama for privacy and cost control, the technology is here and ready for production use.
The future? Even better. As models get smaller and faster, we'll see more edge deployment, multi-camera automation, and seamless ERP integration becoming the standard rather than the exception.
Want to try it yourself? Try here at github.com/yo9e5h/inventory-vision-demo



