LinkedIn is one of the richest sources of buying signals available to any technology services company. A LinkedIn job post for a Senior Data Engineer with Databricks and Azure experience is not a hiring ad. It is a company revealing a strategic bet, a new data platform, a modernization push, a capability gap they are paying to fill. That gap is an opportunity.
The problem is scale. A mid-sized consultancy might see dozens of these signals daily across its target verticals. Reading each posting, deciding if it fits, finding internal proof points that demonstrate credibility, writing a personalized email, done properly, that is 30 to 45 minutes per opportunity. Done quickly, the outreach reads like it was rushed.
This post describes the system we built to close that gap. It takes a raw LinkedIn job posting, understands what the company is actually looking for, searches our internal knowledge base for the most relevant evidence, and generates a tailored outreach email, grounded in real projects , ready for review and sending.
We will keep the focus conceptual, the reasoning behind the architecture rather than the implementation details. If you are building something similar, we hope the thinking here is useful.
The Signal in the Noise
LinkedIn job posts contain far more information than a hiring manager intends to reveal.
Take a posting for a Cloud Infrastructure Lead that lists Kubernetes, Terraform, AWS, and drops a line about “modernizing our legacy on-premises environment.” That single post tells you where the company is in their cloud journey, what they are struggling with, and roughly what kind of help they need. That is genuinely useful for an outreach conversation, if you can extract it reliably.
The problem is that job posts are written for candidates, not for analysis. They mix must-haves with nice-to-haves, use inconsistent terminology, and vary wildly in how much detail they include. Simple keyword matching misses most of this nuance. You end up either over-filtering (missing real opportunities) or under-filtering (pursuing things that are not a fit).
We started from a different angle: treat the job post as a problem description, and treat our knowledge base as evidence that we have solved similar problems. That framing, job post as problem statement, internal knowledge as evidence, drove most of the design decisions that followed.
The System at a Glance
Here is the end-to-end flow before we get into the details. A job post goes in; a personalised, evidence-backed email draft comes out.

Each stage is an independently testable, observable unit, so let’s walk through every single one of them first.
Stage 1: The Relevance Gate
Not every job post is worth pursuing. Before running retrieval and generation, which takes time and compute, we do a quick AI-based relevance check: does this posting fall within the industries and technology domains where we actually have experience?
Filtering early is not just about efficiency. An email written for an opportunity that is genuinely not a fit will always feel thin, regardless of how well the rest of the pipeline works. Getting this check right at the start means everything downstream is working on real opportunities rather than wasted motion.
Stage 2: Understanding What the Job Post Really Wants
A job post is just text. To search against it meaningfully, you need to pull structured information out of it first.
We pass the job post through an LLM with carefully managed prompts that extract a set of structured fields. The goal is to turn what a recruiter wrote in natural language into something the retrieval system can actually work with.

The summary is a short distillation of what the role is really about — this is what goes into semantic search. The technology areas and industry tell us which parts of our knowledge base are even worth looking at. The key technologies are the raw terms that help with keyword-level matching on top of semantic similarity.
Prompts are managed centrally in Langfuse rather than buried in the codebase, which means we can update them, version them, and run experiments without a code deployment. We also inject live reference data into the prompts — the current list of technology domains and industries — so the model is working from a shared vocabulary rather than making up its own terms.
Stage 3: Building the Knowledge Base
Good retrieval depends on good source material. We built our knowledge base from three types of internal content:
Case Studies
- Crawled and ingested from our company website
- Tagged with the technologies and industry vertical each project covers
- The most direct proof that we have solved a specific type of problem before
- Carry the highest weight in scoring when there is a strong match
Client Project References
- Structured project data from internal records
- Covers engagements that may not have a public case study
- Adds breadth across projects and technology areas
- Particularly valuable for less common or emerging technology stacks
Certifications
- Synced automatically from our HR system
- Demonstrates team-wide competence, not just project history
- Useful signal when a job post asks for vendor-specific expertise
- Treated as supplemental evidence rather than primary proof
Everything in the knowledge base is stored with structured metadata alongside it, not just the text. Which technologies it covers, what industry it is from, when it was published. This ends up being just as important as the content itself, because the retrieval and scoring steps lean heavily on it to separate genuinely relevant results from ones that are merely similar-sounding.
Stage 4: Finding the Right Evidence
With a structured understanding of the job post, and a metadata-rich knowledge base, we can now retrieve the most relevant evidence. This is where most of the architectural complexity lives.
Why not just do a similarity search?
Semantic search is a good starting point, but it has a real blind spot in this context: it tells you what is conceptually similar, not what is actually useful. A broad cloud infrastructure case study and a very specific Kubernetes migration case study might score almost the same on embedding similarity — but if the job post is specifically about Kubernetes, one of them is a much stronger reference than the other. Similarity alone does not capture that difference.
So we built a multi-stage retrieval pipeline that layers several signals on top of each other.

Scoping the search space
Before any search runs, we narrow the candidate pool using the metadata from the deconstruction step. A DevOps case study should not show up in an email about data analytics, full stop , even if the wording happens to be vaguely similar. So we filter by technology area and industry upfront, rather than hoping the ranking will push irrelevant results far enough down.
Hybrid search: semantic + keyword
Within that scoped pool, we run two searches in parallel: a dense semantic search on the job summary (good at capturing conceptual overlap) and a sparse keyword search on the raw technology terms (good at catching specific tool names that semantic models sometimes lump together or miss entirely). Both ranked lists are then merged using Reciprocal Rank Fusion, a straightforward technique for combining ranked lists without over-committing to either signal.
Dynamic scoring
The merged results are re-scored using a combination of signals:
- Semantic similarity – How closely does the evidence content match the job post semantically?
- Primary technology match – Does this piece of evidence cover the specific technology areas the job post prioritizes?
- Industry alignment – Is the evidence from a project in the same industry vertical as the target company?
- Recency – More recent case studies and projects are weighted slightly higher, reflecting current capabilities.
Diversity reranking.
High scores alone can leave you with a redundant set, five case studies that all describe essentially the same type of project, just worded differently. We add a diversity step that penalises evidence too similar to something already selected, so the final set covers the job post’s requirements from different angles rather than repeating the same point.
Why all these stages?
Each one catches something the others miss. Keyword search picks up specific tool names that semantic search glosses over. Dynamic scoring adds context that raw similarity scores ignore. Diversity reranking stops the same project from dominating the results. None of these is enough on its own — the value is in running them together.
Stage 5: Generating the Email
By the time we get here, the interesting work is done. The job post has been understood, the relevant evidence has been found and ranked, and the most useful proof points are sitting in a structured set. Generation is deliberately the simplest part of the pipeline, the model’s job is to write a clear, professional email around the evidence, not to invent anything.

The prompt gives the model the job post summary, the company’s industry and technology context, and the curated evidence. It is told to reference specific projects and certifications rather than make generic claims, and to frame everything around the company’s stated challenge rather than our own services.
What comes out is an email grounded in something real: actual projects, specific technologies, verifiable outcomes. The sales team reviewing it can see exactly what evidence each point is drawing from, which makes the review fast and gives them confidence in what they are sending.
Grounding over generation
We deliberately constrain the LLM to draw only on the retrieved evidence. The value of the email is in its specificity, and specificity requires grounding in real facts. A model left to generate freely will produce fluent but unverifiable claims — exactly the kind of email that gets deleted.
The Automated Pipeline
Everything above can also be triggered on demand through an API, which is handy for testing or one-off assessments. But the main use case runs on a schedule: every morning, the system pulls the previous day’s job posts from our CRM, runs them through the pipeline, and queues up email drafts for the sales team.

Email tasks are distributed across team members based on current capacity, so the work spreads evenly rather than piling into one inbox. By the time the team sits down in the morning, the batch is done, relevant opportunities flagged, irrelevant ones filtered out, drafts ready to review and send. What used to take a salesperson 30–40 minutes per opportunity, reading the post, searching for relevant projects, drafting and editing, now takes under 5 minutes for review and send. The research is done before anyone opens their laptop.
Making It Production-Ready
Getting something working in a controlled environment is one thing. Keeping it working reliably when real data comes in every day is another. A few things we focused on to close that gap:
Observability throughout
Every LLM call is traced end-to-end in Langfuse, which gives us full visibility into prompt inputs, model outputs, token usage, and latency. On the infrastructure side, we run OpenTelemetry with Grafana, Loki, and Prometheus for metrics, logs, and distributed tracing across the entire application. When something goes wrong, and in production AI systems, things go wrong, we can diagnose it quickly without guesswork.
Evaluation as a CI/CD gate
Prompt changes and retrieval logic changes can silently degrade output quality — a model might produce fluent text that is no longer accurate or relevant. We treat evaluation as a first-class engineering practice: a suite of tests runs against ground-truth datasets on every pull request, and a minimum average score threshold must be met before a merge is allowed. Regressions are caught before they reach production.
LLM-as-a-judge evaluation
For subjective output quality, like whether a retrieved case study is truly relevant to a given job post, we use a separate LLM call as an automated judge. This allows us to evaluate thousands of retrieval decisions without manual review, while still maintaining a principled quality signal tied to a numeric score.
Provider flexibility
We built the system to be agnostic to the underlying LLM provider. Switching between a locally hosted open-source model and a cloud-hosted commercial model requires a configuration change, not a code change. This matters for cost management, data privacy requirements, and resilience — if a cloud provider has an outage, the system can fall back to a local model.
What We Learned
Building this system taught us things that apply well beyond the specific problem of sales outreach automation. A few that stand out:
Structure is more valuable than fluency
The most impactful investment was not in the email generation prompt — it was in the job post deconstruction and evidence structuring. A well-structured job analysis and a well-tagged knowledge base do most of the work. The LLM’s job is just to write clearly around the structure you give it.
Retrieval quality determines email quality
The generation step cannot compensate for poor retrieval. If the evidence set contains tangentially relevant or redundant items, the email will reflect that. We spent more time tuning the retrieval and scoring pipeline than any other part of the system, and it paid off proportionally.
Human review still matters, at every level
The system produces drafts, not finished emails. Every one gets reviewed by a person before it goes out, and that is intentional. The sender should understand what they are signing off on. Automation does the research; people still own the relationship.
But this principle runs deeper than the final output. When you change a prompt, adjust a scoring weight, or modify retrieval logic, automated metrics will tell you whether aggregate scores improved or dropped — and that matters. But it does not tell you the full story. Actually, reading through individual results — what did the system pull, does this evidence set make sense for this particular job post, does the email feel right — gives you a kind of understanding that a dashboard score simply cannot.
The same goes for knowing your data. Spending time reading actual job posts, case studies, and generated emails — not skimming, but paying attention — surfaces problems that evals miss. You start to recognise where the data is patchy, where the model is confidently wrong, and where the pipeline is doing something quietly clever that you did not plan for. That familiarity is hard to quantify, but it is what makes you good at improving the system when something goes wrong. Metrics are a tool, not a substitute for knowing what you built.
Closing Thoughts
What started as a narrow problem, reducing the time it takes to write one cold email, turned into a broader exercise in building reliable, observable, evidence-driven AI applications. The architecture we built is not novel in any single component; hybrid retrieval, multi-stage reranking, and prompt management are all well-understood techniques. What matters is how they are assembled, and whether the system behaves predictably in production.
The result is a pipeline that a small sales team uses daily to handle a volume of outreach that would have required a much larger team doing this manually. The emails it produces are specific, credible, and grounded, because they have to be backed by something real.
If you are building systems that connect enterprise data signals to sales or business development workflows, we would be glad to compare notes. The hard problems in this space are worth talking about.