AI Coding in Large Codebases Is Not Won by the Model Alone
A large codebase is not just “a lot of files.” It is more like an underground maze that has been under construction for decades: some tunnels live in one giant Git monorepo, some are scattered across dozens of services, and some corners still contain C, C++, C#, Java, and PHP, the languages people assume AI coding tools would rather avoid. With recent model versions, Claude Code often performs better in these environments than teams expect.
Whether Claude Code can actually work in that environment is not only about whether the model is smart enough. The harsher question is: when the model walks in the door, does it know where to go? When it hits legacy systems, internal tools, different test commands in different subdirectories, same-name functions, generated files, and permission rules, is there a track that guides it to the right place?
That track is the part large organizations most often underestimate when they adopt AI coding.
Do Not Stuff the Whole City Into a Vector Database
One common approach is to chunk the whole codebase, embed it, build an index, then retrieve fragments that look relevant at query time. This feels intuitive in a small project, but in a large organization it can turn into a time capsule: engineers commit new code every day, while the index may be stuck hours, days, or even weeks in the past. The function it retrieves may have been renamed two weeks ago, the module may have been deleted last sprint, and the system still says, with a straight face, “this is relevant.”
Claude Code’s path is closer to how an engineer enters a repository: walk the filesystem, read files, search for clues, and follow references. It sees live code on the developer’s machine, without needing to upload the whole repository first or maintain a central index.
But that does not mean you can drop an agent into a billion lines of code and say, “find that weird pattern.” Without starting context, an agent looks like a new hire on day one being asked to fix a “probably simple” bug, except they do not even know where the system starts.
Clawd 's hot take:
Stuffing a large codebase into a retrieval index sometimes feels like getting lost in a department store, then deciding the first step is to photograph every shelf in the building and turn it into an album. Sounds scientific. By the time the album is ready, the third-floor food court has been renovated, the bubble tea shop moved, and the restroom entrance is somewhere else. Agentic search is closer to hiring an engineer who can read signs on-site, but only if the site actually has signs.
The Operating Setup Matters as Much as the Model
The most common misconception in large deployments is treating Claude Code’s capability as “the model’s capability.” Model benchmark scores matter, obviously. But in a real codebase, the operating setup around the model often decides the outcome.
Forget the tool names for a moment. When a team puts Claude Code inside a large codebase, it is really answering three plain questions: which map should it read when it enters? Which repeated actions should not be left to the model’s memory? Which specialized tools should only come out when needed?
The first layer is the map. CLAUDE.md’s job is not to stuff all project knowledge into the model. Its job is to make sure Claude Code knows the broad direction, the important constraints, and the pits it absolutely must not step into as soon as it starts. Put global context at the repository root, local conventions in subdirectories: how to test this service, naming rules for this module, and what historical baggage lives here. Because these files load automatically at the beginning of a session, once they get fat, the whole session slows down. This is not an encyclopedia. It is the floor plan by the entrance that says “do not go here.”
The second layer is automation. Repeatable, verifiable actions should not depend on the model remembering them. Hooks are often first used as guardrails, such as blocking unsafe operations. Their more valuable use is making the setup improve itself: review newly learned context at the end of a session and suggest updates to CLAUDE.md; load team-specific settings by module at the start of a session; run deterministic rules like formatting, linting, and tests through hooks instead of praying the model remembers.
The third layer is on-demand loading. In a large codebase, not every session needs security review, documentation updates, deployment flow, or data-processing rules. The value of Skills is progressive disclosure: load the security-review process only when doing security work; load the documentation workflow only when editing docs. A Skill can also bind to paths. For example, a deployment Skill for the payments service only activates inside the payments directory and does not wander into other modules.
Once those practices actually work, distribution becomes the next problem. The waste in an organization is often not that nobody made a good setup. It is that the good setup stayed on a few senior engineers’ laptops and became tribal knowledge. Plugins can package Skills, Hooks, and MCP configuration into installable bundles, so a new engineer gets the same capability on day one. In a case Anthropic shared, a large retailer packaged a Skill that connected to an internal analytics platform as a plugin, letting business analysts pull performance data without leaving their existing workflow. The value is not “one more feature.” The value is whether the capability can be copied.
Clawd highlights:
短版AI's take: keep the map small and the toolbox on demand.
The easiest way to go wrong here is treating CLAUDE.md like a family recipe tonic and stuffing everything into it. Then every session starts like eating a banquet for breakfast: before the model even starts fixing the bug, its context stomach has already exploded (;´Д`) A mature setup is more like a toolbox: keep the usual screwdriver by the entrance, put the welding torch and laser cutter in the cabinet, and take them out when needed. gu-log has already unpacked Hooks, Skills, and plugins separately. For this piece, keep the big picture: large codebases need a small always-on map, plus a pile of tools that only come out when needed.
Real Navigation Is Not Just Text Search
In a small project, searching for a function name probably still gives you a direction. In a large codebase, searching for a common name may return thousands of results. The agent starts opening files one by one, the context window fills up like a sink with the faucet left on, and it still may not find the right symbol.
At that point, change the mental model: do not just search strings. Let Claude Code use the symbol navigation your IDE already relies on. The familiar “go to definition” and “find all references” features in IDEs are usually backed by LSP, meaning a language server is analyzing the code structure in real time. Expose that layer to Claude Code, and it can follow definitions and references for the same symbol and distinguish same-name but unrelated functions across languages and modules. But this is not magic that falls from the sky. Teams still need to install the right code-analysis plugins and language-server executables for the relevant languages.
For large multilingual or strongly typed codebases, LSP is usually a high-return investment. Another deployment example Anthropic shared involved an enterprise software company that rolled out LSP integration across the organization before a broad Claude Code deployment, specifically to make C and C++ navigation reliable. In a large codebase, finding the wrong symbol is scarier than the model being dumb.
Clawd twists the knife:
The boundary matters here: this is deployment experience shared by Anthropic, not a public benchmark. It supports the conclusion that “LSP is worth investing in for large, multilingual codebases.” Do not read it as “some company’s C++ efficiency improved by X percent” (`・ω・´)
Another kind of navigation lives outside the code. Claude Code seeing files does not mean it sees internal docs, ticket systems, analytics platforms, custom company search, or deployment APIs. Mature teams wrap those internal tools as MCP servers, so Claude can fetch data through structured tool calls instead of relying on humans to copy and paste.
Once maps, on-demand knowledge, symbol navigation, and internal tools are relatively stable, some teams start splitting large exploration tasks out to subagents. A subagent is an independent Claude instance with its own context window. It can do read-only exploration of a subsystem, write its findings into a file, then hand clean results back to the main agent for implementation. This is not about showing off. It is about keeping the main session from being clogged by the pathfinding process: exploration stays exploration, editing stays editing.
Clawd highlights:
Subagents are not the romantic version of “just call more AIs to help.” Without clear division of labor, multi-agent work is like adding three people halfway through a meeting, and every one of them says, “let me add some context.” Once you have an operating setup, subagents can become real delegation: one checks the warehouse, one reads the invoices, and the main agent sits at the counter handling the return.
Large Codebases Have to Become Readable First
Whether Claude Code can help is limited by whether it can find the right context. Load too much and performance drops. Load too little and it is walking through a maze blindfolded. The shared pattern across several mature deployments is the same: make the codebase readable to the agent.
CLAUDE.md should be thin and layered. The root should contain only the highest-level information and the critical traps; subdirectories can add local rules. When Claude Code starts in a subdirectory, it reads the CLAUDE.md files along the path upward, so it does not have to start from the repository root every time. This is a little counterintuitive in a monorepo, because many tools are used to demanding root-level authority. But for an agent, starting from the relevant subdirectory usually makes it easier to focus.
Test and lint commands should be layered by subdirectory too. Running the whole company’s test suite after changing one service is like watering a tiny succulent on your desk with a fire truck: impressive pressure, dead plant. Each subdirectory CLAUDE.md should identify the build, test, and lint commands that apply to that area. Service-oriented architectures are usually easier here. Deeply intertwined compiled-language monorepos may need project-specific build configuration to make this work.
Generated files, build artifacts, and third-party code should be excluded. Commit permissions.deny rules to .claude/settings.json, and the team can share the same noise-reduction setup. Exceptions still need an exit. If some engineers are developing a code generator and generated files are the work target, local settings can override the project-level exclusion rules without affecting everyone else.
Some legacy systems have directory structures that provide no help at all. In that case, a lightweight repository map is useful: a root-level markdown file listing the top-level folders, with one sentence explaining what each contains. If there are too many top-level folders, make the map hierarchical: the root only explains the highest-level structure, and subdirectories add the next layer. In smaller cases, directly @-mentioning relevant files or directories can achieve a similar effect.
But this layered approach has limits. Hundreds of thousands of folders, millions of files, non-Git version control, game engines with huge binary assets, and workflows where non-engineers also commit code all require extra setup and judgment. This is not the fairy tale where “AI coding has solved every enterprise problem.” It is about paving the road for most traditional software engineering environments first.
Configuration Expires, and Organizations Drift
There is another easy-to-miss problem: today’s useful instruction can become tomorrow’s shackle.
As models improve, old CLAUDE.md rules that compensated for model weaknesses may no longer be necessary, and may even block new models from doing better work. A rule that once said “only modify one file per refactor” may have helped early models stay under control. But when a model can already coordinate changes across files, that rule becomes ankle weights. Some Hooks and Skills follow the same pattern: they may have existed to fill a tooling gap, but once Claude Code supports that capability natively, they should retire.
A practical cadence is to review the setup every three to six months. After major model updates, if the team feels performance has stalled, it is worth checking again. This is not documentation neat-freakery. It is how you keep the operating setup from turning from scaffolding into plaster.
Beyond technical configuration, someone needs to own the system. Successful adoption is usually not just opening access and expecting everyone to figure it out naturally. Smoother cases start with a small group, or even one person, organizing the tools, plugins, MCP, permissions, and CLAUDE.md conventions before broader rollout. Adoption spreads when the first contact creates a positive feedback loop. If the first contact feels like fixing a printer at midnight, enthusiasm evaporates quickly.
That role often lands on developer experience or developer productivity teams. A more explicit shape is an agent manager: somewhere between PM and engineer, responsible for the Claude Code ecosystem. Without a dedicated team, the minimum is a DRI with decision rights over configuration, permission policies, the plugin marketplace, and CLAUDE.md conventions, plus responsibility for keeping them current.
Large organizations also run into governance early: which Skills and plugins are allowed? How do you prevent thousands of engineers from rebuilding the same wheel? How does AI-generated code go through the same code-review process? The steadier way to start is to define approved Skills, required review flows, and a limited initial scope, then expand as confidence grows. Bringing engineering, security, and governance in early is much easier than patching holes after adoption.
Clawd murmur:
The worst way to adopt AI coding is “free range.” It is lively at first: everyone invents a mysterious local config, writes their own prompt set, and half-builds their own plugin bundle. Three months later, the organization did not get productivity. It got a folk village, and every little hut has its own ancestral rules. The DRI’s job is not to micromanage. It is to turn the practices that actually work into shared infrastructure.
Closing
AI coding in a large codebase does not magically light up when you plug a stronger model into a bigger repository. The model is the engine, but the operating setup is the roads, signals, maps, maintenance crew, and traffic rules.
A good deployment usually looks like an organized construction site: a thin map at the entrance, warning signs around dangerous areas, machines handling fixed checks, special tools pulled out only when needed, formal entrances into internal systems, and large scouting tasks delegated before the main work begins. Finally, a DRI or agent manager keeps the whole thing from slowly rotting under model updates, team growth, and governance pressure.
The real bar for large codebases is not whether AI can write code. The real bar is whether the organization has first arranged its engineering world so an agent can understand it, move through it, and then decide, under that organization’s own tools, risks, and governance constraints, where it should actually be allowed to make changes.