AI-Infused Development Needs More Than Prompts

The current conversation about AI in software development is still happening at the wrong layer. Most of the attention goes to code generation. Can the model write a method, scaffold an API, refactor a service, or generate tests? Those things matter, and they are often useful. But they are not the hard part of enterprise […]

Apr 8, 2026 0 7

AI-Infused Development Needs More Than Prompts

The current conversation about AI in software development is still happening at the wrong layer.

Most of the attention goes to code generation. Can the model write a method, scaffold an API, refactor a service, or generate tests? Those things matter, and they are often useful. But they are not the hard part of enterprise software delivery. In real organizations, teams rarely fail because nobody could produce code quickly enough. They fail because intent is unclear, architectural boundaries are weak, local decisions drift away from platform standards, and verification happens too late.

That becomes even more obvious once AI enters the workflow. AI does not just accelerate implementation. It accelerates whatever conditions already exist around the work. If the team has clear constraints, good context, and strong verification, AI can be a powerful multiplier. If the team has ambiguity, tacit knowledge, and undocumented decisions, AI amplifies those too.

That is why the next phase of AI-infused development will not be defined by prompt cleverness. It will be defined by how well teams can make intent explicit and how effectively they can keep control close to the work.

This shift has become clearer to me through recent work around IBM Bob, an AI-powered development partner I have been working with closely for a couple of months now, and the broader patterns emerging in AI-assisted development.

The real value is not that a model can write code. The real value appears when AI operates inside a system that exposes the right context, limits the action space, and verifies outcomes before bad assumptions spread.

The code generation story is too small

The market likes simple narratives, and “AI helps developers write code faster” is a simple narrative. It demos well. You can measure it in isolated tasks. It produces screenshots and benchmark charts. It also misses the point.

Enterprise development is not primarily a typing problem. It is a coordination problem. It is an architecture problem. It is a constraints problem.

A useful change in a large Java codebase is rarely just a matter of producing syntactically correct code. The change has to fit an existing domain model, respect service boundaries, align with platform rules, use approved libraries, satisfy security requirements, integrate with CI and testing, and avoid creating support headaches for the next team that touches it. The code is only one artifact in a much larger system of intent.

Human developers understand this instinctively, even if they do not always document it well. They know that a “working” solution can still be wrong because it violates conventions, leaks responsibility across modules, introduces fragile coupling, or conflicts with how the organization actually ships software.

AI systems do not infer those boundaries reliably from a vague instruction and a partial code snapshot. If the intent is not explicit, the model fills in the gaps. Sometimes it fills them in well enough to look impressive. Sometimes it fills them in with plausible nonsense. In both cases, the danger is the same. The system appears more certain than the surrounding context justifies.

This is why teams that treat AI as an ungoverned autocomplete layer eventually run into a wall. The first wave feels productive. The second wave exposes drift.

AI amplifies ambiguity

There is a phrase I keep coming back to because it captures the problem cleanly. If intent is missing, the model fills the gap.

That is not a flaw unique to one product or one model. It is a predictable property of probabilistic systems operating in underspecified environments. The model will produce the most likely continuation of the context it sees. If the context is incomplete, contradictory, or detached from the architectural reality of the system, the output may still look polished. It may even compile. But it is working from an invented understanding.

This becomes especially visible in enterprise modernization work. A legacy system is full of patterns shaped by old constraints, partial migrations, local workarounds, and decisions nobody wrote down. A model can inspect the code, but it cannot magically recover the missing intent behind every design choice. Without guidance, it may preserve the wrong things, simplify the wrong abstractions, or generate a modernization path that looks efficient on paper but conflicts with operational reality.

The same pattern shows up in greenfield projects, just faster. A team starts with a few useful AI wins, then gradually notices inconsistency. Different services solve the same problem differently. Similar APIs drift in style. Platform standards are applied unevenly. Security and compliance checks move to the end. Architecture reviews become cleanup exercises instead of design checkpoints.

AI did not create those problems. It accelerated them.

That is why the real question is no longer whether AI can generate code. It can. The more important question is whether the development system around the model can express intent clearly enough to make that generation trustworthy.

Intent needs to become a first-class artifact

For a long time, teams treated intent as something informal. It lived in architecture diagrams, old wiki pages, Slack threads, code reviews, and the heads of senior developers. That has always been fragile, but human teams could compensate for some of it through conversation and shared experience.

AI changes the economics of that informality. A system that acts at machine speed needs machine-readable guidance. If you want AI to operate effectively in a codebase, intent has to move closer to the repository and closer to the task.

That does not mean every project needs a heavy governance framework. It means the important rules can no longer stay implicit.

Intent, in this context, includes architectural boundaries, approved patterns, coding conventions, domain constraints, migration goals, security rules, and expectations about how work should be verified. It also includes task scope. One of the most effective controls in AI-assisted development is simply making the task smaller and sharper. The moment AI is attached to repository-local guidance, scoped instructions, architectural context, and tool-mediated workflows, the quality of the interaction changes. The system is no longer guessing in the dark based on a chat transcript and a few visible files. It is operating inside a shaped environment.

One practical expression of this shift is spec-driven development. Instead of treating requirements, boundaries, and expected behavior as loose background context, teams make them explicit in artifacts that both humans and AI systems can work from. The specification stops being passive documentation and becomes an operational input to development.

That is a much more useful model for enterprise development.

The important pattern is not tool-specific. It applies across the category. AI becomes more reliable when intent is externalized into artifacts the system can actually use. That can include local guidance files, architecture notes, workflow definitions, test contracts, tool descriptions, policy checks, specialized modes, and bounded task instructions. The exact format matters less than the principle. The model should not have to reverse engineer your engineering system from scattered hints.

Cost is a complexity problem disguised as a sizing problem

This becomes even clearer when you look at migration work and try to attach cost to it.

One of the recent discussions I had with a colleague was about how to size modernization work in token/cost terms. At first glance, lines of code look like the obvious anchor. They are easy to count, easy to compare, and simple to put into a table. The problem is that they do not explain the work very well.

What we are seeing in migration exercises matches what most experienced engineers would expect. Cost is often less about raw application size and more about how the application is built. A 30,000 line application with old security, XML-heavy configuration, custom build logic, and a messy integration surface can be harder to modernize than a much larger codebase with cleaner boundaries and healthier build and test behavior.

That gap matters because it exposes the same flaw as the code-generation narrative. Superficial output measures are easy to report, but they are weak predictors of real delivery effort.

If AI-infused development is going to be taken seriously in enterprise modernization, it needs better effort signals than repository size alone. Size still matters, but only as one input. The more useful indicators are framework and runtime distance. Those can be expressed in the number of modules or deployables, the age of the dependencies or the number of files actually touched.

This is an architectural discussion. Complexity lives in boundaries, dependencies, side effects, and hidden assumptions. Those are exactly the areas where intent and control matter most.

Measured facts and inferred effort should not be collapsed into one story

There is another lesson here that applies beyond migrations. Teams often ask AI systems to produce a single comprehensive summary at the end of a workflow. They want the sequential list of changes, the observed results, the effort estimate, the pricing logic, and the business classification all in one polished report. It sounds efficient, but it creates a problem. Measured facts and inferred judgment get mixed together until the output looks more precise than it really is.

A better pattern is to separate workflow telemetry from sizing recommendations. The first artifact should describe what actually happened. How many files were analyzed or modified. How many lines changed in which time. How many tokens were actually consumed. Or which prerequisites were installed or verified. That is factual telemetry. It is useful because it is grounded.

The second artifact should classify the work. How large and complex was the migration. How broad was the change. How much verification effort is likely required. That is interpretation. It can still be useful, but it should be presented as a recommendation, not as observed truth.

AI is very good at producing complete-sounding narratives but enterprise teams need systems that are equally good at separating what was measured from what was inferred.

A two-axis model is closer to real modernization work

If we want AI-assisted modernization to be economically credible, a one-dimensional sizing model will not be enough. A much more realistic model is at least two-dimensional. The first axis is size, meaning the overall scope of the repository or modernization target. The second axis is complexity. This stands for things like legacy depth, security posture, integration breadth, test quality, and the amount of ambiguity the system must absorb.

That model reflects real modernization work far better than a single LOC (lines of code)-driven label. It also gives architects and engineering leaders a much more honest explanation for why two similarly sized applications can land in very different token ranges.

And it reinforces the core point: Complexity is where missing intent becomes expensive.

A code assistant can produce output quickly in both projects. But the project with deeper legacy assumptions, more security changes, and more fragile integrations will demand far more control. It will need tighter scope, better architectural guidance, more explicit task framing, and stronger verification. In other words, the economic cost of modernization is directly tied to how much intent must be recovered and how much control must be imposed to keep the system safe. That is a much more useful way to think about AI-infused development than raw generation speed.

Control is what makes AI scale

Control is what turns AI assistance from an interesting capability into an operationally useful one. In practice, control means the AI does not just have broad access to generate output. It works through constrained surfaces. It sees selected context. It can take actions through known tools. It can be checked against expected outcomes. Its work can be verified continuously instead of inspected only at the end.

A lot of recent excitement around agents misses this point. The ambition is understandable. People want systems that can take higher-level goals and move work forward with less direct supervision. But in software development, open-ended autonomy is usually the least interesting form of automation. Most enterprise teams do not need a model with more freedom. They need a model operating inside better boundaries.

That means scoped tasks, local rules, architecture-aware context, and tool contracts, all with verification built directly into the flow. It also means being careful about what we ask the model to report. In migration work, some data is directly observed, such as files changed, elapsed time, or recorded token use. Other data is inferred, such as migration complexity or likely cost. If a prompt asks the model to present both as one seamless summary, it can create false confidence by making estimates sound like facts. A better workflow requires the model to separate measured results from recommendations and to avoid claiming precision the system did not actually record.

Once you look at it this way, the center of gravity shifts. The hard problem is no longer how to prompt the model better. The hard problem is how to engineer the surrounding system so the model has the right inputs, the right limits, and the right feedback loops. That is a software architecture problem.

This is not prompt engineering

Prompt engineering suggests that the main lever is wording. Ask more precisely. Structure the request better. Add examples. Those techniques help at the margins, and they can be useful for isolated tasks. But they are not a durable answer for complex development environments. The more scalable approach is to improve the system around the prompt.

The more scalable approach is to improve the surrounding system with explicit context (like repository and architecture constraints), constrained actions (via workflow-aware tools and policies), and integrated tests and validation.

This is why intent and control is a more useful framing than better prompting. It moves the conversation from tricks to systems. It treats AI as one component in a broader engineering loop rather than as a magic interface that becomes trustworthy if phrased correctly.

That is also the frame enterprise teams need if they want to move from experimentation to adoption. Most organizations do not need another internal workshop on how to write smarter prompts. They need better ways to encode standards and context, constrain AI actions, and implement verification that separates facts from recommendations.

A more realistic maturity model

The pattern I expect to see more often over the next few months is fairly simple. Teams will begin with chat-based assistance and local code generation because it is easy to try and immediately useful. Then they will discover that generic assistance plateaus quickly in larger systems.

In theory, the next step is repository-aware AI, where models can see more of the code and its structure. In practice, we are only starting to approach that stage now. Some leading models only recently moved to 1 million-token context windows, and even that does not mean unlimited codebase understanding. Google describes 1 million tokens as enough for roughly 30,000 lines of code at once, and Anthropic only recently added 1 million-token support to Claude 4.6 models.

That sounds large until you compare it with real enterprise systems. Many legacy Java applications are much larger than that, sometimes by an order of magnitude. One case cited by vFunction describes a 20-year-old Java EE monolith with more than 10,000 classes and roughly 8 million lines of code. Even smaller legacy estates often include multiple modules, generated sources, XML configuration, old test assets, scripts, deployment descriptors, and integration code that all compete for attention.

So repository-aware AI today usually does not mean that the agent fully ingests and truly understands the whole repository. More often, it means the system retrieves and focuses on the parts that look relevant to the current task. That is useful, but it is not the same as holistic awareness. Sourcegraph makes this point directly in its work on coding assistants: Without strong context retrieval, models fall back to generic answers, and the quality of the result depends heavily on finding the right code context for the task. Anthropic describes a similar constraint from the tooling side, where tool definitions alone can consume tens of thousands of tokens before any real work begins, forcing systems to load context selectively and on demand.

That is why I think the industry should be careful with the phrase “repository-aware.” In many real workflows, the model is not aware of the repository in any complete sense. It is aware of a working slice of the repository, shaped by retrieval, summarization, tool selection, and whatever the agent has chosen to inspect so far. That is progress, but it still leaves plenty of room for blind spots, especially in large modernization efforts where the hardest problems often sit outside the files currently in focus.

After that, the important move is making intent explicit through local guidance, architectural rules, workflow definitions, and task shaping. Then comes stronger control, which means policy-aware tools, bounded actions, better telemetry, and built-in verification. Only after those layers are in place does broader agentic behavior start to make operational sense.

This sequence matters because it separates visible capability from durable capability. Many teams are trying to jump directly to autonomous flows without doing the quieter work of exposing intent and engineering control. That will produce impressive demos and uneven outcomes. The teams that get real leverage from AI-infused development will be the ones that treat intent as infrastructure.

The architecture question that matters now

For the last year, the question has often been, “What can the model generate?” That was a reasonable place to start because generation was the obvious breakthrough. But it is not the question that will determine whether AI becomes dependable in real delivery environments.

The better question is: “What intent can the system expose, and what control can it enforce?”

That is the level where enterprise value starts to become durable. It is where architecture, platform engineering, developer experience, and governance meet. It is also where the work becomes most interesting, not as a story about an assistant producing code but as part of a larger shift toward intent-rich, controlled, tool-mediated development systems.

AI is making discipline more visible.

Teams that understand this will not just ship code faster. They will build development systems that are more predictable, more scalable, more economically legible, and far better aligned with how enterprise software actually gets delivered.