LinkedIn is one of the richest sources of organizational intelligence available today. A job post for a Senior Data Engineer with Databricks and Azure experience is not just a hiring ad. It’s a company revealing a strategic bet, a new data platform, a modernization push, a capability gap they’re paying to fill. That gap carries meaning, whether you’re in sales, HR, partnerships, or business development.
The problem is scale. A mid-sized organization might see dozens of these signals daily across its target verticals. Reading each posting, deciding if it fits, finding internal proof points that demonstrate credibility, writing a personalized message, done properly, that’s 30 to 45 minutes per opportunity. Done quickly, the outreach reads like it was rushed.
This post describes the system we built to close that gap. It takes a raw LinkedIn job posting, understands what the company is actually looking for, searches our internal knowledge base for the most relevant evidence, and generates a tailored message grounded in real projects, ready for review and sending.
We’ll keep the focus conceptual, the reasoning behind the architecture rather than the implementation details. If you’re building something similar, we hope the thinking here is useful.
The Signal in the Noise
LinkedIn job posts contain far more information than a hiring manager intends to reveal.
Take a posting for a Cloud Infrastructure Lead that lists Kubernetes, Terraform, AWS, and drops a line about “modernizing our legacy on-premises environment.” That single post tells you where the company is in their cloud journey, what they’re struggling with, and roughly what kind of help they need. That’s genuinely useful context, if you can extract it reliably.
The problem is that job posts are written for candidates, not for analysis. They mix must-haves with nice-to-haves, use inconsistent terminology, and vary wildly in how much detail they include. Simple keyword matching misses most of this nuance. You end up either over-filtering (missing real opportunities) or under-filtering (pursuing things that aren’t a fit).
We started from a different angle: treat the job post as a problem description, and treat our knowledge base as evidence that we’ve solved similar problems. That framing, job post as problem statement, internal knowledge as evidence — drove most of the design decisions that followed.
The System at a Glance
Here is the end-to-end flow before we get into the details. A job post goes in; a personalised, evidence-backed email draft comes out.

Each stage is an independently testable, observable unit, so let’s walk through every single one of them first.
Stage 1: The Relevance Gate
Not every job post is worth pursuing. Before running retrieval and generation, which takes time and compute, we do a quick AI-based relevance check: does this posting fall within the industries and technology domains where we actually have experience?
Filtering early isn’t just about efficiency. A message written for an opportunity that’s genuinely not a fit will always feel thin, regardless of how well the rest of the pipeline works. Getting this check right at the start means everything downstream is working on real opportunities rather than wasted motion.
Stage 2: Understanding What the Job Post Really Wants
A job post is just text. To search against it meaningfully, you need to pull structured information out of it first.
We pass the job post through an LLM with carefully managed prompts that extract a set of structured fields. The goal is to turn what a recruiter wrote in natural language into something the retrieval system can actually work with.

The summary is a short distillation of what the role is really about – this is what goes into semantic search. The technology areas and industry tell us which parts of our knowledge base are even worth looking at. The key technologies are the raw terms that help with keyword-level matching on top of semantic similarity.
Prompts are managed centrally in Langfuse rather than buried in the codebase, which means we can update them, version them, and run experiments without a code deployment. We also inject live reference data into the prompts – the current list of technology domains and industries — so the model is working from a shared vocabulary rather than making up its own terms.
Stage 3: Building the Knowledge Base
Good retrieval depends on good source material. We built our knowledge base from three types of internal content:
Case Studies
- Crawled and ingested from our company website, tagged with the technologies and industry vertical each project covers. The most direct proof that we’ve solved a specific type of problem before. They carry the highest weight in scoring when there’s a strong match.
Client Project References
- Structured project data from internal records, covering engagements that may not have a public case study. Adds breadth across projects and technology areas, and is particularly valuable for less common or emerging technology stacks.
Certifications
- Synced automatically from our HR system. Demonstrates team-wide competence, not just project history. Useful signal when a job post asks for vendor-specific expertise. Treated as supplemental evidence rather than primary proof.
Everything in the knowledge base is stored with structured metadata alongside it, not just the text. Which technologies it covers, what industry it’s from, when it was published. This ends up being just as important as the content itself, because the retrieval and scoring steps lean heavily on it to separate genuinely relevant results from ones that are merely similar-sounding.
Stage 4: Finding the Right Evidence
With a structured understanding of the job post and a metadata-rich knowledge base, we can now retrieve the most relevant evidence. This is where most of the architectural complexity lives.
Why not just do a similarity search?
Semantic search is a good starting point, but it has a real blind spot in this context: it tells you what’s conceptually similar, not what’s actually useful. A broad cloud infrastructure case study and a very specific Kubernetes migration case study might score almost the same on embedding similarity — but if the job post is specifically about Kubernetes, one of them is a much stronger reference than the other. Similarity alone doesn’t capture that difference.
So we built a multi-stage retrieval pipeline that layers several signals on top of each other.

Scoping the search space
Scoping the search space – Before any search runs, we narrow the candidate pool using metadata from the deconstruction step. A DevOps case study shouldn’t show up in a message about data analytics, even if the wording happens to be vaguely similar. We filter by technology area and industry upfront, rather than hoping the ranking pushes irrelevant results far enough down.
Hybrid search: semantic + keyword
Within that scoped pool, we run two searches in parallel: a dense semantic search on the job summary (good at capturing conceptual overlap) and a sparse keyword search on the raw technology terms (good at catching specific tool names that semantic models sometimes lump together or miss entirely). Both ranked lists are then merged using Reciprocal Rank Fusion, a straightforward technique for combining ranked lists without over-committing to either signal.
Dynamic scoring
The merged results are re-scored using a combination of signals:
- Semantic similarity – How closely does the evidence content match the job post semantically?
- Primary technology match – Does this piece of evidence cover the specific technology areas the job post prioritizes?
- Industry alignment – Is the evidence from a project in the same industry vertical as the target company?
- Recency – More recent case studies and projects are weighted slightly higher, reflecting current capabilities.
Diversity reranking
High scores alone can leave you with a redundant set: five case studies that all describe essentially the same type of project, just worded differently. We add a diversity step that penalises evidence too similar to something already selected, so the final set covers the job post’s requirements from different angles rather than repeating the same point.
Why all these stages?
Each stage catches something the others miss. Keyword search picks up specific tool names that semantic search glosses over. Dynamic scoring adds context that raw similarity scores ignore. Diversity reranking stops the same project from dominating the results. None of these is enough on its own – the value is in running them together.
Stage 5: Generating the Email
By the time we get here, the interesting work is done. The job post has been understood, the relevant evidence has been found and ranked, and the most useful proof points are sitting in a structured set. Generation is deliberately the simplest part of the pipeline — the model’s job is to write a clear, professional message around the evidence, not to invent anything.

The prompt gives the model the job post summary, the company’s industry and technology context, and the curated evidence. It’s told to reference specific projects and certifications rather than make generic claims, and to frame everything around the company’s stated challenge rather than our own services.
What comes out is a message grounded in something real: actual projects, specific technologies, verifiable outcomes. The person reviewing it can see exactly what evidence each point draws from, which makes review fast and gives them confidence in what they’re sending.
Grounding over generation
We deliberately constrain the LLM to draw only on the retrieved evidence. The value of the message is in its specificity, and specificity requires grounding in real facts. A model left to generate freely produces fluent but unverifiable claims – exactly the kind of message that gets ignored.
The Automated Pipeline
Everything above can also be triggered on demand through an API, which is handy for testing or one-off assessments. But the main use case runs on a schedule: every morning, the system pulls the previous day’s job posts from our CRM, runs them through the pipeline, and queues up message drafts for the team.

Tasks are distributed across team members based on current capacity, so the work spreads evenly rather than piling into one inbox. By the time the team sits down in the morning, the batch is done, relevant opportunities flagged, irrelevant ones filtered out, drafts ready to review and send. What used to take 30 to 40 minutes per opportunity – reading the post, searching for relevant projects, drafting and editing – now takes under 5 minutes for review and send. The research is done before anyone opens their laptop.
Making It Production-Ready
Getting something working in a controlled environment is one thing. Keeping it working reliably when real data comes in every day is another.
Observability throughout
Every LLM call is traced end-to-end in Langfuse, which gives us full visibility into prompt inputs, model outputs, token usage, and latency. On the infrastructure side, we run OpenTelemetry with Grafana, Loki, and Prometheus for metrics, logs, and distributed tracing across the entire application. When something goes wrong — and in production AI systems, things go wrong — we can diagnose it quickly without guesswork.
Evaluation as a CI/CD gate
Prompt changes and retrieval logic changes can silently degrade output quality. A model might produce fluent text that’s no longer accurate or relevant. We treat evaluation as a first-class engineering practice: a suite of tests runs against ground-truth datasets on every pull request, and a minimum average score threshold must be met before a merge is allowed. Regressions are caught before they reach production.
LLM-as-a-judge evaluation
For subjective output quality, like whether a retrieved case study is truly relevant to a given job post, we use a separate LLM call as an automated judge. This lets us evaluate thousands of retrieval decisions without manual review, while still maintaining a principled quality signal tied to a numeric score.
Provider flexibility
We built the system to be agnostic to the underlying LLM provider. Switching between a locally hosted open-source model and a cloud-hosted commercial model requires a configuration change, not a code change. This matters for cost management, data privacy requirements, and resilience.
What We Learned
Building this system taught us things that apply well beyond the specific problem of sales outreach automation. A few that stand out:
Structure is more valuable than fluency
The most impactful investment wasn’t in the message generation prompt – it was in the job post deconstruction and evidence structuring. A well-structured job analysis and a well-tagged knowledge base do most of the work. The LLM’s job is just to write clearly around the structure you give it.
Retrieval quality determines email quality
The generation step can’t compensate for poor retrieval. If the evidence set contains tangentially relevant or redundant items, the output will reflect that. We spent more time tuning the retrieval and scoring pipeline than any other part of the system, and it paid off proportionally.
Human review still matters, at every level
The system produces drafts, not finished messages. Every one gets reviewed by a person before it goes out, and that’s intentional. The sender should understand what they’re signing off on. Automation does the research; people still own the relationship.
But this principle runs deeper than the final output. When you change a prompt, adjust a scoring weight, or modify retrieval logic, automated metrics will tell you whether aggregate scores improved or dropped – and that matters. But it doesn’t tell you the full story. Actually reading through individual results – what did the system pull, does this evidence set make sense for this particular job post, does the message feel right – gives you a kind of understanding that a dashboard score simply can’t.
The same goes for knowing your data. Spending time reading actual job posts, case studies, and generated messages – not skimming, but paying attention – surfaces problems that evals miss. You start to recognise where the data is patchy, where the model is confidently wrong, and where the pipeline is doing something quietly clever that you didn’t plan for. That familiarity is hard to quantify, but it’s what makes you good at improving the system when something goes wrong. Metrics are a tool, not a substitute for knowing what you built.
Closing Thoughts
What started as a narrow problem – reducing the time it takes to act on a single signal – turned into a broader exercise in building reliable, observable, evidence-driven AI applications. The architecture we built isn’t novel in any single component; hybrid retrieval, multi-stage reranking, and prompt management are all well-understood techniques. What matters is how they’re assembled, and whether the system behaves predictably in production.
The result is a pipeline that a small team uses daily to handle a volume of work that would have required a much larger team doing it manually. The outputs it produces are specific, credible, and grounded – because they have to be backed by something real.
If you’re building systems that connect organizational signals to action, we’d be glad to compare notes.