Your Vendor Scorecard Is Grading the Wrong Company

A traditional scorecard measures the supplier. Your AI risk lives in the model — and in the accountability you were never allowed to outsource.

Jun 30, 2026

On 18 November 2025, the three European Supervisory Authorities did something they had never done before. They published a list of the companies the financial system cannot afford to lose. The press release is dry the ESAs “publish today the list of designated critical ICT third-party providers (CTPPs) under the Digital Operational Resilience Act.” Read past the acronyms and it is remarkable: a banking regulator, an insurance regulator and a markets regulator agreed, jointly, on the firms whose failure would threaten European finance. None of them are banks. They are the cloud, data and infrastructure providers everyone else quietly runs on.

The criteria are worth holding onto, because a vendor scorecard never asks about them. The assessment weighed each provider’s “systemic importance, its role in supporting critical or important functions for financial entities, and the level of substitutability of its services.” Not revenue. Not certifications. Substitutability — how badly you are stuck if it changes or fails.

DORA had applied since January 2025; November is when it bit, the moment third-party concentration stopped being a slide in a risk deck and became a named supervisory object. Earlier this month the Financial Stability Board put the same anxiety into a global frame, consulting on twelve “sound practices” for AI adoption — a report it says draws on “financial institutions and their technology vendors.” The regulators have noticed where the dependency now lives. Most procurement functions have not caught up.

The consensus is sharpening the wrong instrument

Every third-party briefing this quarter is asking the same question: how do we get AI vendors into our vendor risk management process? Tighten the scorecard. Add a row for “AI.” Bolt a model-risk annex onto the supplier questionnaire and move on.

That instinct treats the scorecard as incomplete. It is not incomplete. It is pointed at the wrong entity. A traditional vendor scorecard was built to answer one question
will this supplier still be standing next year, and it answers that question well. Financial health, uptime, certifications, support responsiveness. But AI did not add a row to that question. It moved the risk to two places the scorecard has no column for: the model, which you cannot see into, and your own accountability, which you were never permitted to hand over. You cannot patch a measurement problem by measuring the same thing harder.

Here is the thesis, and it is meant to land somewhere uncomfortable. A vendor scorecard tells you whether the company will survive. Your regulator is measuring whether you will — and those are not the same audit. You can buy the capability. You cannot sell the accountability.

What follows:

- Why AI relocates risk from the supplier to the model — and why “the vendor is certified” stops being an answer

- The two metrics your selection process is missing: opacity and concentration

- The two clauses that do the work a scorecard cannot: on-demand audit, and flow-down to the model behind your model

- The two operating metrics that matter after signature: compensatory testing in your context, and time-to-substitute

- A seventy-two-hour diagnostic on your last three AI purchases that names which company your scorecard was really grading

The framework: score the model, write the access, measure the dependency

Three stages, following the life of a vendor relationship: selection, contract, operation. Each carries two metrics the traditional scorecard omits — and the omission is not carelessness but an artefact of the question the old scorecard was built to answer.

The contradiction underneath rewards naming. You want the vendor’s capability that is why you are buying rather than building, but you cannot accept the vendor’s opacity, because your regulator holds you to account regardless. Outsource the build; insource the verification and the accountability. That separation is the whole design, and every missing metric falls straight out of it.

Stage One — Selection: score the model, not the company

1. Measure opacity, not just assurance

The universal pattern. Procurement rewards what it can inspect — certifications, audited accounts, reference calls. The trouble with AI is that its core risk sits in the part you cannot inspect: the training data, the weights, the behaviour at the edges. A scorecard that only credits the visible is structurally blind to what matters most.

The regulated-industries manifestation. The Singapore MAS information paper on AI Model Risk Management names this directly. Third-party AI, it observes, brings “the lack of transparency” as “a key challenge,” because providers “may be reluctant to disclose proprietary information about their training data or algorithms, hindering banks’ efforts in risk assessment and ongoing monitoring.” The Bank of England and FCA’s 2024 survey of AI in UK financial services found the predictable consequence: 46% of firms reported only “partial understanding” of the AI technologies they use, the gap concentrated in the third-party models they bought rather than built.

Your translation. Opacity becomes a scored dimension in selection, not a footnote discovered later. The RFP asks for training-data provenance, evaluation access, model-card depth and change-notification policy — and a vendor that will not answer scores worse than one that answers badly, because a known weakness is governable and an unknown one is not. Procurement, model risk and architecture own it together, not in sequence.

What to build by Friday. A model-transparency disclosure section appended to your AI RFP template; six to eight questions a vendor answers in writing before the commercial conversation starts.

2. Map the concentration, not just the contract

The universal pattern. Resilience engineering measures shared dependencies, not the quality of individual parts. Ten excellent components that all rest on one power supply is not a robust system; it is one failure wearing ten green lights.

The regulated-industries manifestation. This is the risk DORA’s designation was built to surface, with substitutability as a named criterion. The European Banking Authority’s outsourcing guidelines require firms to weigh “concentration risks at the sector level, e.g. where multiple institutions or payment institutions make use of a single service provider or a small group of service providers.” And it is already measurable: the Bank of England and FCA survey found the top three providers account for **73%, 44% and 33%** of all reported cloud, model and data providers respectively, while a third of AI use cases are now third-party implementations, up from 17% two years earlier.

Your translation. Concentration is mapped across the AI estate, not assessed one contract at a time — because a per-contract review structurally cannot see a cross-portfolio risk. Architecture owns the map; risk reads it. The question it answers: how many of our critical use cases trace back to the same foundation model, cloud region, or data provider.

What to build by Friday. A one-page concentration map. List your live AI use cases; behind each, name the underlying model and cloud region; count the use cases sharing a single model. That number is your finding.

Stage Two — Contract: write the access you cannot see into the agreement

3. The audit-and-notification clause

The universal pattern. When you cannot inspect something continuously, you contract for the right to inspect it on demand, and to be told when it changes. Every discipline that depends on something it cannot watch every second — aviation maintenance, food safety, reinsurance — writes the right to look, and the duty to disclose, into the agreement.

The regulated-industries manifestation. MAS sets this out almost as a checklist. Legal agreements with third-party AI providers, the AI MRM paper advises, should include clauses on “performance guarantees, data protection, the right to audit, and notification when AI is introduced (or not incorporating AI without the bank’s agreement) in existing third-party providers’ solutions.” That last clause is the quietly radical one — the right to be told when a vendor *adds* AI to a product you already bought without it. A traditional scorecard has no mechanism for a supplier silently becoming an AI vendor between renewals.

Your translation. These clauses become a standard addendum, drafted once and attached to every AI procurement, not renegotiated deal by deal — because a right you must argue for each time is a right you will eventually trade away under deadline. Legal, procurement and model risk own the template jointly.

What to build by Friday. A one-page AI contract addendum naming four non-negotiable clauses: right to audit, performance guarantees, data-protection terms, and change-notification — including when AI is newly introduced into an existing service.

4. The flow-down clause: reach the model behind your model

The universal pattern. A guarantee that stops at your direct supplier is not a guarantee; it is a handoff. The risk lives one layer down, in their suppliers, and obligations that do not pass downstream leak out the bottom of the chain.

The regulated-industries manifestation. The EU AI Act writes this into law. Under Article 25, the provider of a high-risk AI system and a third party supplying a component “shall, by written agreement, specify the necessary information, capabilities, technical access and other assistance” needed for compliance. The EBA guidelines make the same demand of sub-outsourcing: a subcontractor must grant “the same contractual rights of access and audit” as the primary provider. Translated: your right to audit is worthless if it stops at the reseller and never reaches the foundation-model provider doing the actual work.

Your translation. The contract requires your vendor to pass audit and notification rights through to the model provider behind them — and you first establish which of your vendors are, in truth, reselling someone else’s model. Legal and architecture own this together, because you cannot write a flow-down clause for a dependency you have not mapped.

What to build by Friday. A flow-down clause for your addendum, plus a short list of which current AI vendors are intermediaries for a larger model provider. The list is usually longer than procurement expects.

Stage Three — Operation: measure the dependency, not just the performance

5. Compensatory testing in your own context

The universal pattern. You test the bridge under your traffic, not the manufacturer’s. A vendor’s benchmark describes the vendor’s conditions; your risk lives in yours — your data, your customers, your edge cases — and the gap between the two is where the unpleasant surprises wait.

The regulated-industries manifestation. MAS calls this compensatory testing: “conducting rigorous testing of third-party AI models using various datasets and scenarios to verify the model’s robustness and stability in the bank’s context, and to detect potential biases.” The phrase “in the bank’s context” carries the whole point. A vendor’s 94% accuracy on their evaluation set tells you almost nothing about behaviour on the population you actually serve.

Your translation. You own an evaluation harness, run it on your data on a cadence — and you treat the vendor’s dashboard as a marketing artefact, not an assurance one. Model risk and engineering own the harness; its output is a drift metric you control, not one the vendor reports to you.

What to build by Friday. A context-specific test set for your highest-materiality third-party model, and one drift metric you own — measured by you, on your data, on a fixed interval.

6. Substitutability: measure time-to-exit

The universal pattern. Resilience is measured by recovery, not by uptime. For any dependency, the metric that matters is not how well it performs on a good day but how quickly you can replace it on a bad one. A supplier you cannot leave is not a supplier; it is a single point of failure with an invoice attached.

The regulated-industries manifestation. This is the criterion the European regulators placed at the centre of the DORA designation — substitutability — and the contingency MAS asks firms to plan for: “developing robust contingency plans to address potential failures, unexpected behaviour of third-party AI, or discontinuing of support by vendors.” The accountability spine sits underneath it, and three regulators say it in nearly identical words. The HKMA: outsourcing transfers “day-to-day managerial responsibility, but not accountability.” The EBA: “the outsourcing of functions cannot result in the delegation of the management body’s responsibilities.” MAS, on outsourcing, expects an arrangement “managed as if the services were still managed by the Bank.” If the model fails and you cannot exit, the regulator does not turn to your vendor. It turns to you.

Your translation. Time-to-substitute becomes a tracked metric for high-materiality use cases — the measured cost, in days, of moving off the current model. Architecture and the product owner own it. A dependency you have never rehearsed leaving is one you do not actually understand.

What to build by Friday. A vendor-exit runbook for your single most critical AI use case, ending in one number: the realistic time-to-substitute. The first time you measure it, it will be longer than anyone guessed.

None of this is unique to finance. A pharmaceutical company validating a vendor’s model for adverse-event triage, an energy operator buying a model for load-balancing, a hospital deploying a supplier’s ambient documentation — each carries an accountability no contract can transfer, under a different regulator using the same logic. Regulated intelligence is wider than financial services, and the scorecard problem is identical across it.

A note on confidence

The reader deserves the line between what is documented and what is argued. Confirmed, and linked below: the DORA designation and its substitutability criterion; the FSB consultation and its dates; the EBA guidelines on non-delegation and sector concentration; the HKMA and MAS accountability language; the MAS AI Model Risk Management paper’s third-party section; the Bank of England and FCA statistics. Strong inference, from repeated observation rather than measurement: that most firms’ AI vendor scorecards are inherited wholesale from SaaS and IT procurement, carrying no column for model opacity, concentration, or time-to-exit. I have seen this across jurisdictions; I have not seen it counted at scale, and will not dress a pattern as a statistic. Design hypothesis, for argument: that time-to-substitute is the highest-leverage metric to add, and that six across three stages is the right granularity — fewer misses the model, more rebuilds the bureaucracy you were escaping. Disagree with that last one.

What the column count tells you

A diagnostic that takes ten minutes and tends to settle the argument. Pull the vendor scorecard your team used on its last three AI purchases and sort its columns into two piles: those describing the vendor — revenue, headcount, certifications, SLA, support tier — and those describing the model — evaluation results on your own data, training-data provenance, audit rights actually exercised, measured time-to-substitute. The first pile is almost always full, the second almost always empty. That lopsidedness is not a gap in diligence. It is a portrait of which company you were really grading — rarely the one whose model is now making decisions about your customers.

What the kit does

Knowing the scorecard is pointed at the wrong entity is the easy part. Rebuilding it — turning “score the model” into six questions your procurement and risk teams can run on Monday — is the work, and it is exactly the structured, repetitive work a well-built prompt does fast. This week’s Prompt Kit is four sequenced prompts: the first audits your existing scorecard for the missing columns, the second builds the model-transparency disclosure for your RFP, the third drafts the contract addendum and flow-down clause, the fourth produces the concentration map and time-to-substitute template. It is this essay’s framework rendered as something you run against your real vendor list, not admire as a diagram.

Run this in seventy-two hours

Take your three most important AI vendors. For each, answer four questions without calling the vendor. Have we tested this model on our own data, or are we trusting their benchmark? Do we hold a written, exercisable right to audit it and to be told when it changes? If they retired it next quarter, how many days to substitute? And how many of our other critical use cases sit on the same underlying model? Most readers reach the recognition by the second vendor: the scorecard answered none of these. It certified the vendor was solvent and SOC 2 compliant — which is to say it graded a company that was never the source of the risk. That accountability, every regulator above agrees, did not move when the contract was signed. It is still sitting on your side of the table.

If you are rebuilding AI vendor governance — and want a structured outside read on your selection metrics, your contract addendum, or the concentration map of your AI estate — mention it in your reply. I take one such conversation per month. Specific beats general: name the vendor or use case that is keeping you up, and we will start there.

What the Prompt kit does

This week’s Prompt kit 6 is Four prompts to rebuild your AI vendor scorecard around the metrics that actually de-risk you.

Replies to this post reach me directly. I read all of them.

Discussion about this post

Ready for more?