California Management Review
California Management Review is a premier professional management journal for practitioners published at UC Berkeley Haas School of Business.
Kevin Schmitt and Ivo Blohm
Image Credit | Cla78
While Generative Artificial Intelligence (GenAI) Proof of Concepts (POCs) may show promise, transitioning them from controlled environments to enterprise-wide deployment presents new challenges. Issues such as hallucinations, the inscrutability of training data, and GenAI’s high opacity hinder scaling. However, leading companies are overcoming these barriers by following four key rules: Grounding GenAI outputs in verifiable data, measuring performance with contextual metrics, adapting agile processes to be more exploratory, and teaming up across functions to sustain scaling momentum. Instead of resisting GenAI’s probabilistic nature, business leaders must learn to manage it, turning a black box into a structured, value-generating corporate capability.
Rebecka C. Ångström et al., “Getting AI Implementation Right: Insights from a Global Survey,” California Management Review 66, no. 1 (Fall 2023): 5–22.
In September 2023, Amazon set out to transform Alexa from a simple command-based interface into a GenAI-powered virtual assistant.1 However, as the rollout expanded around August 2024, unexpected issues emerged. Users reported bizarre and inaccurate responses, known as “hallucinations.” Alexa, for example, falsely claimed that a research facility in Alaska generated the northern lights and overstated UK National Health Service waiting lists by approximately one million people.2 Such errors underscore the challenge of ensuring GenAI provides accurate and reliable information. It took Amazon until February 2025 to develop the necessary breakthroughs to address these challenges and successfully scale its GenAI-powered Alexa.3
Amazon’s experience is hardly unique. Scaling a GenAI pilot into an enterprise-wide capability is far more complex than proving a concept works. It is not just about expansion; it is about sustaining and compounding value as complexity increases. While companies are pouring huge investments into GenAI, few manage to translate early successes into lasting impact.
So why does scaling GenAI so often fail? And more importantly, what can business leaders do differently?
Most organizations assume that scaling GenAI is primarily a technical challenge, one that can be solved by adding computing power or deploying more advanced GenAI models. But in reality, scaling fails not due to technical limitations, but because companies overlook GenAI’s dark side:
Versatility: GenAI can generate text, code, images, video, and music. Its flexibility makes it powerful but difficult to control.4 Small input variations can produce drastically different outputs, complicating business reliability. GenAI is also prone to producing false information and remains vulnerable to manipulation. A ChatGPT-powered Chevrolet dealership chatbot, for example, mistakenly offered an $81,000 Tahoe for just $1.5
Inscrutability of Training Data: GenAI learns from vast datasets sourced from the internet, books, and other materials. Yet organizations have little visibility into what their GenAI learns.6 This lack of transparency increases risks of bias, misinformation, and copyright violations. The New York Times lawsuit against OpenAI and Microsoft over unauthorized use of content highlights the growing legal and ethical concerns surrounding GenAI’s training data.7
High Opacity: GenAI’s inner workings remain very opaque, making it difficult to trace how outputs are generated from inputs. This lack of explainability erodes trust, particularly in high-risk industries like medicine, finance, and law, where decision transparency is critical.8 Without clear interpretability, companies risk deploying GenAI applications they cannot fully understand and control.
GenAI’s dark side traces back to a single challenge: its probabilistic nature. However, organizations do not like probabilistic systems. Organizations like controllable systems that can be easily audited and optimized. While companies may tolerate some unpredictability during experimentation in the name of innovation, GenAI’s strengths quickly become liabilities when proofs of concept must scale.
Over the last three years (2022-2025), we interviewed (i.e., 87 interviews) and followed 23 companies (i.e., retail banks, investment banks, insurers, service providers for the financial industry, energy & utility, and fashion companies) in Switzerland. Among our case companies, the successful scalers do not try to eliminate GenAI’s probabilistic nature; they learn to manage it. They follow four key rules to turn GenAI from an unpredictable wildcard into a disciplined, business value-generating corporate capability.
GenAI’s power lies in its ability to generate open-ended responses across a wide range of tasks. Left unchecked, that power can produce unreliable or even misleading information, eroding trust and effectiveness. The solution? Ground GenAI in the realities of use cases. Companies that successfully scale GenAI establish a structured, reliable reference point, whether a curated database, verified company policies, or expert-reviewed knowledge.
While grounding is critical for both GenAI Products and Customized GenAI Solutions (see Figure 1), the approach differs. In GenAI Products, grounding ensures outputs are validated against an external source of truth. In GenAI Customized Solutions, the ground truth is embedded directly into the model itself via fine-tuning (see Figure 1). Fine-tuning enhances an existing GenAI model by training it on domain-specific data, improving accuracy and relevance for a targeted use case.
For example, Swisscoding and IBM developed a Customized GenAI Solution to automate medical coding. Their goal: translate hospital records into standardized ICD billing codes, a painstaking process that typically takes 25 minutes per case. The biggest hurdle? Data access and quality. Hospitals hesitated to share patient records. Also, many medical documents were incomplete. Of the approximately one million records sourced, only 60,000 met the quality standard required for fine-tuning. But the effort paid off. Once deployed, the GenAI application slashed processing time by 99% — “That means you go down from twenty-five minutes to two and a half seconds,” the CEO noted.

Figure 1. Generative AI Products and Customized Solutions
Traditional IT systems operate on clear pass/fail rules, either they work as expected, or they do not. GenAI, however, does not follow these rules. Its outputs are probabilistic, meaning performance is not always black and white. A response might be acceptable, but not truly optimal.
Companies that successfully scale GenAI move beyond exhaustive test cases and adopt a three-pronged approach to performance measurement:
Measuring GenAI performance is not about pass/fail tests, it is about continuous learning and improvement. That is exactly how a leading Swiss bank and Paretolabs approached their GenAI-powered customer service email assistant. Instead of relying on rigid test cases, they put employees in the loop. Customer service agents rated GenAI-generated responses on a 5-star scale, creating a continuous feedback loop that enabled rapid improvements.
To further refine accuracy, they moved beyond traditional benchmarks, using semantic similarity metrics to evaluate whether GenAI-generated responses captured meaning rather than merely following rules. The impact? Over time, the system became more aligned with business needs, and employees transitioned from interacting with the bot to overseeing it. As one bank manager put it: “Every front- and back-office employee will soon be working with bots—not just using them but managing them.”
Agile methodologies, like the Scaled Agile Framework (SAFe), work well in structured environments where each iteration reduces uncertainty. But when scaling GenAI, that playbook breaks down. Unlike traditional IT, which follows mostly predictable scaling patterns, GenAI operates in an open-ended problem space. Its outputs are probabilistic, its behavior evolves, and new edge cases constantly emerge. Thus, companies must learn to navigate ongoing high levels of uncertainty (see Table 2).
When one major Swiss bank and Paretolabs set out to integrate GenAI into their agile workflows, they quickly realized that traditional IT scaling principles did not apply. Unlike conventional IT, where teams refine predictable blueprints, GenAI demands continuous experimentation. “With Generative AI, you never really know what is going to happen next. There is always an element of uncertainty,” one AI engineer admitted. Another AI engineer at one of Switzerland’s largest insurers put it more bluntly: “With GenAI, there is so much unexpected, so much unclear. It is extremely exploratory […] it cannot be planned. We are working in a completely different mode.”
Instead of forcing GenAI into a rigid roadmap, successful GenAI scalers embraced a guardrail approach. Rather than prescribing every step, they focused on defining clear boundaries, outlining is and what is not permissible, while giving teams the flexibility to experiment within those limits. The result? Faster iterations, smarter risk-taking, and greater opportunities to explore the technology’s full potential.

Table 2. Difference Between Agile and Generative AI Development
GenAI is an ongoing radical innovation that cuts across the organization, demanding a new approach to resource allocation. Scaling GenAI is not a one-and-done project; it requires ongoing investment. “The [GenAI] use case needs continuous refinement. Without this mindset, teams may think the use case is finished,” one GenAI engineer explained.
Yet many organizations struggle because GenAI resources remain trapped in functional silos. Nearly all of our case companies faced this exact challenge: individual business units launched GenAI pilots, but scaling stalled due to a lack of access to critical resources, ranging from legal and ethical guidance to IT and business support. Without a structured mechanism to coordinate efforts across departments, promising initiatives never advanced beyond the proof-of-concept stage. “The greater challenge lies in guiding the business units that are not functioning optimally today. […] We […] must collaborate more closely,” as the CIO of a health insurer pointed out.
The most successful GenAI scaler established a GenAI Stream to break this cycle. Every two weeks, leaders from AI ethics, IT, legal, and business functions met in this GenAI Stream to align priorities, troubleshoot roadblocks, and reallocate scaling resources where they were needed most. Also, over time, experts within the GenAI stream gained greater decision-making authority, even securing veto power on key decisions. One middle manager described the shift: “I deliberately challenged the CEO to see how serious he was about [The right he gave employees to veto his decisions]. He responded without hesitation, there was no debate. But having veto power is not about saying, ‘Stop, I do not care anymore.’ It also means clearly explaining why a delay is necessary, when additional resources are required, or when more funding is needed.” This approach paid off and allowed the bank to iteratively improve its customer service center to become GenAI-driven: […] we are quite advanced among banks in the Swiss market because we are able to automate approximately 80% of all customer inquiries, the bank’s COO noted.
Companies that successfully scale GenAI do more than enhance productivity and efficiency, they redefine industry standards. But scaling is not about expanding GenAI initiatives indiscriminately. The key is to channel GenAI’s probabilistic nature through rules that transform it into a controlled driver of business value.
Organizations that thrive in this new era follow four essential rules:
When history teaches business leaders a valuable lesson, it is that companies excel not by resisting technological change but by shaping it to meet their business needs. With the proper rules, GenAI can become a powerful business value-generating corporate capability, not an unpredictable wildcard.