// ML, data science, computer vision

Custom ML, neural networks, and
real data science

We build production ML for problems where off-the-shelf AI is not enough. Gradient boosting, custom neural networks, computer vision pipelines, fine-tuned foundation models, and RAG systems - shipped with the inference layer and the monitoring around it. Auditable models, real engineering, no LLM hype.

Scope a projectSee a recent ML case
// what we build

Six things we do under the ML and data science umbrella.

Tabular ML and gradient boosting

XGBoost, LightGBM, CatBoost on raw transactional data. Churn prediction, risk scoring, reactivation probability, fraud detection, demand forecasting. Auditable scores with the features that drove them - not a black box. This is what we use when the problem is structured and the cost of a wrong answer is real money.

// examples
iGaming retention scoring (live pilot), risk and fraud scoring, demand and propensity models

Custom neural networks from scratch

When the problem has temporal structure or representation needs that gradient boosting cannot capture, we build the architecture. PyTorch and TensorFlow. Time series, sequence modeling, signal generation. Backtested before deployment, monitored after.

// examples
Trading signal generation (shipped), time series forecasting with exogenous variables

Computer vision pipelines

Document understanding, OCR, layout analysis, and image classification. Claude Vision and Gemini for general document AI, custom OpenCV and small vision models when the task is constrained and latency matters. We choose the model that fits the constraint, not the model that sounds good.

// examples
Invoice and delivery-note extraction across varied supplier formats, document verification, structured QC checks

Foundation model fine-tuning and adaptation

Fine-tuning where the platform supports it (OpenAI, open-weight Llama and Mistral). Prompt engineering, retrieval-augmented generation, and prompt caching where it does not (Claude). We do not pretend fine-tuning is the answer to every domain problem - often RAG with the right chunking and reranking outperforms a fine-tune at a fraction of the cost and risk.

// examples
Domain-specific assistants, customer support deflection, internal knowledge agents

Retrieval-augmented generation and context engineering

Embedding selection, chunking strategy, hybrid search, reranking, and grounded generation. We design the retrieval stack so the LLM has the right context, not all the context. This is usually the highest-leverage move on real client problems before reaching for fine-tuning.

// examples
Customer support over product docs, legal and compliance Q and A, sales enablement

Production ML serving and monitoring

FastAPI inference services, batch and streaming pipelines, model versioning, feature stores when justified, drift monitoring, and retraining loops. We ship the model and the surrounding system, not a Jupyter notebook the client cannot run.

// examples
Daily scoring jobs, real-time risk APIs, A/B-tested model rollouts
// how we work

Four principles that shape every ML project we ship.

01

Start with the data, not the model

Most ML projects fail at the data layer. We audit account-type flags, consent state, label leakage, time-window correctness, and class imbalance before any model is trained. A clean dataset with a simple model beats a dirty dataset with a sophisticated one.

02

Choose the simplest model that hits the bar

Gradient boosting on tabular data beats neural networks more often than people admit. We do not reach for transformers when XGBoost solves the problem. The right model is the one your team can audit, deploy, and retrain - not the one with the highest-cited paper.

03

Do not use LLMs where formula-based ML is the correct tool

LLMs hallucinate. If the answer drives money (a risk score, a credit decision, a churn probability), we use auditable ML. We use LLMs where hallucination is recoverable: drafting copy a human will review, summarizing for an analyst, generating explanations of an auditable score.

04

Ship the model and the system around it

Inference API, monitoring, retraining cadence, rollback path. A model in a notebook is not a deliverable. We do not call something production-ready unless it can run without us.

// when this is the right call
  • You have a structured prediction problem (churn, risk, fraud, demand, propensity) and off-the-shelf scoring APIs do not understand your data.
  • You need a custom architecture because the problem has temporal, multimodal, or domain-specific structure that pre-trained models do not capture.
  • You have a large document corpus or a vision pipeline where Claude Vision or GPT-4o alone is too expensive, too slow, or not accurate enough at production scale.
  • You want to fine-tune or adapt a foundation model for a domain (legal, medical, finance, gaming) and need someone to scope realistically what fine-tuning will and will not buy you.
  • You have a working LLM prototype and need to turn it into a production system with retrieval, evaluation, monitoring, and cost control.
// when it is not
  • Your problem is fully solved by an off-the-shelf API (OpenAI, Claude, an existing vendor model) and you do not need customization. We will tell you to use it and not bill you for what you do not need.
  • You do not have data yet. ML needs labels and history. If you are pre-data, we will build the automation first and the model later.
  • You want to fine-tune a foundation model when a well-designed prompt plus RAG would deliver the same outcome cheaper and faster. We will redirect you.
// recent ML work

Two cases that show how we actually build models.

// stack

What we use, when it actually fits.

ML and modeling
PyTorch, scikit-learn, XGBoost, LightGBM, CatBoost, Hugging Face Transformers
Foundation models
Claude (Anthropic), GPT-4 / GPT-4o (OpenAI), Llama 3, Mistral, Gemini
Vision
Claude Vision, Gemini Vision, OpenCV, YOLO, custom CNN backbones when warranted
Retrieval and RAG
pgvector, Qdrant, Pinecone, BM25 hybrid search, Cohere and Voyage rerankers
Serving and infra
FastAPI, Postgres, Redis, Vercel, Modal, AWS, on-prem when compliance requires
Tracking and ops
Weights and Biases, MLflow, Prefect, dbt for feature pipelines
// FAQ

Questions we actually get.

Do you train models from scratch or just use APIs?
Both, depending on the problem. For most structured prediction problems we train custom models from scratch (gradient boosting, custom neural networks). For language and vision problems we usually start with a foundation model and adapt it through prompting, RAG, or fine-tuning where supported. We do not reach for one approach by default.
Can you fine-tune Claude?
Anthropic does not currently expose general-purpose fine-tuning for Claude. We adapt Claude through prompt engineering, retrieval-augmented generation, prompt caching, and context engineering. For platforms that do support fine-tuning (OpenAI, open-weight models like Llama 3 and Mistral), we fine-tune when there is a real benefit over RAG. We will tell you honestly when fine-tuning is not worth the cost.
How do you decide between gradient boosting and a neural network?
Tabular data with under a few million rows usually goes to gradient boosting. The training is faster, the model is auditable, and it tends to win on real-world structured prediction tasks. We move to neural networks when the problem has temporal structure (long sequences, time series with exogenous variables), multimodal inputs, or representation needs that boosting cannot capture.
What is your stance on LLM hallucination for scoring or decisions?
We do not use LLMs to produce risk scores, churn probabilities, or other numbers that drive money. A hallucinated risk score is a real loss. For decisioning we use auditable ML (gradient boosting, calibrated probabilities, explanations from feature importances or SHAP). We use Claude or GPT for tasks where hallucination is recoverable: drafting copy, summarizing for a human reviewer, generating explanations of an already-auditable score.
Do you handle deployment and monitoring or just the model?
We ship the model and the system around it. Inference API, model versioning, drift monitoring, retraining cadence, rollback path. A Jupyter notebook is not a deliverable. We do not call something production-ready unless it can run without us.
How do you scope a custom ML project?
We start with a data audit and a clearly defined target metric. Then we agree on the smallest model that has a chance of hitting the bar, agree on the offline evaluation method, and ship that. Most projects are six to twelve weeks from data audit to a deployed v1. Larger systems with retraining and monitoring add four to eight weeks.

Have a problem an off-the-shelf model cannot solve?

Bring us the data and the target. We will scope honestly - including the cases where you do not need us.

Book a 30-min scoping call