Custom Natural Language Processing: How to Choose a Provider
Every NLP demo works on ten clean documents. Production text is multilingual, PII-laden, and malformed. Choose a custom NLP provider on how it handles the messy 20%: evaluation, redaction, and structured-output reliability.

Every natural language processing demo works. Feed a model ten clean documents and it classifies them perfectly, and everyone in the room nods. Production text is nothing like those ten documents. It arrives in three languages, half of it contains personal data you are not allowed to store, and a slice of it is malformed in ways no one predicted. A custom NLP provider earns its fee on that messy slice, not on the demo, and that is where you should be looking.
Where custom NLP solutions actually break
Modern NLP is mostly a pipeline now: a large language model doing the reading, wrapped in code that feeds it the right text and checks what comes back. The reading is the easy, commoditised part. The pipeline is where projects live or die, and the failure points are boringly consistent. Mixed languages the model handles unevenly. Personal data that has to be detected and redacted before anything is stored, under GDPR. Outputs that are supposed to be structured and occasionally are not. Volume that turns a fine per-document cost into an unaffordable monthly one.
A provider selling custom NLP solutions on demo quality is selling you the easy 20%. The question that separates them is what happens to the other 80%: the documents that are not clean, not in one language, and not safe to log. Ask for the messy examples, not the polished ones.
Bring three of your own ugliest documents to the first meeting. A real NLP provider wants to see those. A demo-driven one wants to show you theirs.
Five things a serious NLP development provider gets right
Score a provider on production concerns, not benchmark scores.
- Evaluation on your documents. Public benchmark numbers say nothing about your text. A serious provider evaluates on a sample of your real documents and reports a metric tied to your outcome, not a leaderboard.
- Multilingual honesty. If your text mixes Portuguese and English, ask how accuracy differs by language. Models are uneven across languages, and a provider who claims uniform performance has not measured it.
- PII detection and redaction. Under GDPR, personal data in text has to be found and handled before storage. Ask how they detect it, what gets redacted, and where data sits. Vagueness here is a compliance risk you inherit.
- Structured-output reliability. If the pipeline must return clean JSON or a fixed schema, ask what happens when the model returns something malformed. Schema validation, retries, and a fallback are the marks of a system built for production, not a notebook.
- A human-review loop for low confidence. The pipeline should know when it is unsure and route those cases to a person rather than guess. Confident wrong answers are worse than flagged uncertain ones.
What disciplined NLP looks like when it ships
A good text pipeline is judged on precision under real conditions, not on a demo. LexAlert, the legislative monitoring system we built for a Portuguese law firm, reads Portuguese legal text every three hours, matches it against a keyword and decree set calibrated to the firm’s practice areas, separates critical changes from routine ones, and deduplicates against everything it has ever flagged. The discipline that makes it work is the calibration: it was tuned on real legislation until the lawyers trusted that every alert was worth opening. That is production NLP, precision on messy real text, not a benchmark score.
There is an earlier decision worth making before you hire an NLP provider at all: whether your problem needs a trained model or just a well-built pipeline, which is the subject of choosing a company to build custom machine learning models. For the broader category of language-driven automation, see what AI automation is and is not. Most NLP needs in 2026 are pipeline problems, not training problems.
When off-the-shelf or no NLP wins
- A hosted API covers it. Common tasks like sentiment, generic entity extraction, or translation are solved products. Call one before commissioning a bespoke pipeline.
- A rule beats a model. If the text is structured and predictable, a few regular expressions or a lookup can outperform NLP at a fraction of the cost and with none of the unpredictability.
- The volume is tiny. If a person handles the documents in an hour a week, automating it is a hobby project, not an investment. Spend the effort where the volume is.
- Every NLP demo works. Production text is multilingual, PII-laden, and malformed, and that is where providers separate.
- Modern NLP is an LLM pipeline; the reading is commodity, the pipeline is where projects live or die.
- Evaluate on your own documents, multilingual honesty, PII redaction, structured-output reliability, and a low-confidence review loop.
- LexAlert shows disciplined NLP: precision on messy real legal text, earned through calibration, not a benchmark.
- Skip a custom provider when a hosted API fits, a rule beats a model, or the document volume is tiny.
The right custom NLP provider is the one who asks to see your messiest documents first. gamgi’s audit starts there, with your real text and the volume, languages, and privacy constraints around it, before scoping a pipeline. What does your hardest-to-read pile of documents actually cost you in hours each week?
Book your AI audit

