Zum Inhalt springen

Dead End: Intel Itanium

Zusammenfassung

Intel’s Itanium was the most expensive architectural failure in computing history. A joint project with Hewlett-Packard begun in 1989, the Itanium was designed to be the 64-bit architecture that replaced x86 — not incrementally but completely. It introduced a radically new model, EPIC (Explicitly Parallel Instruction Computing), that delegated scheduling decisions from the processor to the compiler. It shipped in 2001 after a decade of development. AMD introduced a 64-bit extension of x86 in 2003 that Intel was forced to adopt, and by 2012 every major operating system had dropped Itanium support. Intel continued making Itanium chips until 2021, for a legacy installed base that existed primarily in HP servers. The architecture’s failure illustrates what happens when theoretical elegance meets the installed base of the most successful instruction set in computing history.

The x86 Problem and the RISC Solution

By the late 1980s, Intel and the broader industry viewed x86 as a dead end. The x86 instruction set, designed in 1978 for an 8086 processor intended for embedded applications, had been extended by increments into 32-bit operation (the 80386, 1985) — accumulating complexity, irregular instructions, and architectural compromises along the way. RISC (Reduced Instruction Set Computing) processors — SPARC, MIPS, PA-RISC, IBM POWER — were cleaner, faster for scientific workloads, and easier to optimize for high-performance execution. DEC’s Alpha was achieving clock speeds x86 could not approach.

Intel was dominant in PC and server markets through x86’s ecosystem: the software base, the compiler infrastructure, the peripheral compatibility. But internally, engineers and executives believed x86 would eventually hit a wall and that a next-generation architecture was necessary. Hewlett-Packard had a similar problem with its PA-RISC architecture and had been researching EPIC concepts since the early 1980s under researchers Bob Rau and Josh Fisher (who had developed VLIW — Very Long Instruction Word — architectures at Yale).

EPIC: The Compiler as Scheduler

The EPIC model inverted the conventional relationship between processor and compiler. In a conventional superscalar processor, the hardware contains complex logic to detect independent instructions, schedule them in parallel, and resolve dependencies at runtime. This hardware — the out-of-order execution unit — is expensive in silicon area and power.

EPIC’s premise was that compilers could do this work better and do it once at compile time, rather than having each processor repeat the analysis at runtime for every execution. A compiler with knowledge of the program’s full structure could determine that instruction A and instruction B could execute simultaneously, encode this in the instruction bundle, and the processor would simply execute both. The processor would be simpler and faster because it did no scheduling itself.

The idea was theoretically sound and practically treacherous. It assumed that compilers could accurately analyze parallelism in real programs — not just numerical loops but control-heavy code with unpredictable branches, pointer aliasing, and runtime dependencies. It assumed that the compiler’s static analysis could predict what the hardware would need to do. In practice, real-world programs defeated these assumptions constantly.

The Performance Disaster

The first Itanium processor, code-named “Merced,” shipped in June 2001 after being delayed from 1998 and then from 2000. Its performance was poor — slower than the Intel Xeon x86 server processor on integer workloads, and competitive with SPARC and POWER only on scientific floating-point applications where compiler analysis was tractable. The HP Compiler team, which had spent years developing the HP-UX Itanium compiler, produced the best Itanium performance numbers; other compiler vendors never caught up.

Native code performance was acceptable for specific workloads. The problem was x86 compatibility. Itanium included hardware x86 emulation, called IA-32 Execution Layer. Running x86 code on Itanium was dramatically slower than running the same code on an x86 processor — in some benchmarks by a factor of 5 to 10. Since most software was x86 software, this made Itanium effectively useless for general-purpose server work: you could not run Windows efficiently, could not run most Linux applications without recompiling them for IA-64, and could not access the enormous body of compiled x86 code that formed the server software ecosystem.

Itanium Jokes

The processor acquired informal names among engineers: “Itanic” (unsinkable until it wasn’t) and “Itanium” as-is were both common. Intel’s official response to questions about x86 emulation performance was to encourage users to recompile their software from source for IA-64. In an era when most server software was distributed as binaries, this was not a practical recommendation.

AMD64 and the Reversal

In April 2003, AMD shipped the Opteron processor implementing AMD64 — a 64-bit extension of the x86 instruction set that was fully backward compatible with 32-bit x86 software. The Opteron ran existing x86 binaries at full speed and ran 64-bit code when recompiled. Its memory bandwidth and multi-processor scaling were excellent for server workloads.

Microsoft announced Windows XP x64 Edition compatibility for AMD64. Linux distributions added x86_64 support. The software ecosystem aligned to AMD64 rather than Itanium with remarkable speed. Intel was forced to adopt AMD’s 64-bit extension under license — calling it EM64T (Extended Memory 64 Technology) — and shipped it in the Xeon and Pentium 4 processors in 2004. Intel had spent fifteen years and billions of dollars developing a 64-bit architecture. AMD negated it with a 64-bit extension that cost a fraction as much and was backward compatible.

The HP-Intel Partnership

The Itanium project was unusual in the semiconductor industry: two large companies co-designed the same processor from scratch. HP’s contribution was the PA-RISC compiler team’s expertise in EPIC scheduling and its extensive enterprise Unix customer base. Intel’s contribution was advanced CMOS fabrication and the capital required to bring a new architecture to market. The partnership was formally announced in June 1994 under the project codename “IA-64” (Intel Architecture 64-bit).

HP assigned over 1,000 engineers to Itanium compiler development — one of the largest compiler teams ever assembled for a single architecture. The compiler had to perform an analysis that no previous production compiler had attempted at scale: taking ordinary C, C++, and Fortran programs and automatically scheduling their instructions for parallel execution across Itanium’s instruction bundles. The HP compiler was good; it produced the best Itanium performance numbers in the industry. The problem was that “best Itanium performance” still compared unfavorably to the x86 processors Intel was simultaneously improving.

The management of the partnership was contentious. HP had expected Itanium to replace PA-RISC on a timeline compatible with HP’s enterprise sales cycles — meaning servers in production by 1998. Intel’s fabrication schedules slipped. The first Merced chip arrived in 2001, three years late, and at a process node (0.18 micron) that was already behind Intel’s x86 roadmap. Engineers from both companies who worked on the project described the partnership as a difficult marriage between two cultures that neither fully merged nor fully separated.

The Compiler Challenge in Practice

The EPIC model’s practical failure was not theoretical — it was empirical. The compiler’s ability to find parallelism depends on proving that two operations are independent: that they do not read or write the same memory, that their results do not depend on each other, and that neither depends on a branch outcome. For numerical loops over arrays with static bounds, this analysis is tractable. For the general case of enterprise software — pointer-heavy object-oriented code, dynamic dispatch, external library calls — it is not.

Pointer aliasing was the central problem. In C and C++, two pointer variables may point to the same memory location. A compiler that does not know whether *p and *q are aliased cannot schedule operations on them in parallel; doing so would produce incorrect results if they overlap. Detecting aliasing statically requires whole-program analysis that was computationally expensive and incomplete for C’s memory model.

Dynamic dispatch compounded this. An object method call in C++ or Java does not have a statically known target; the specific function to call is determined at runtime. The compiler cannot parallelize across a virtual function call without knowing what code will execute. Enterprise Java applications — which were supposed to be a major Itanium workload through J2EE servers — made constant use of virtual dispatch.

HP’s solution was to rely heavily on profiling: running programs once to gather execution statistics, then recompiling with those statistics to guide scheduling decisions. This worked for programs that were run frequently with predictable inputs (database servers, transaction processing applications) and failed for general-purpose code. It required a workflow — profile, recompile, deploy — that most developers were unwilling to adopt.

Windows on Itanium

Microsoft invested substantially in Itanium, producing Windows 2000 Advanced Server Limited Edition (2001) and Windows XP 64-bit Edition (2003) for IA-64. The enterprise target was clear: Windows on Itanium would be Microsoft’s answer to HP-UX and Solaris for high-end server workloads.

Windows on Itanium was slow for x86 applications (running under the IA-32 Execution Layer) and had an incomplete native application ecosystem. Microsoft’s own SQL Server and Exchange Server were available in Itanium-native versions, but most third-party Windows software existed only as x86 binaries. SQL Server benchmarks on Itanium were competitive for specific database workloads with large memory configurations — Itanium’s large physical address space (up to 1 TB, versus x86’s 32-bit limit of 4 GB before PAE extensions) was genuinely useful for in-memory database operations.

The AMD64 extension eliminated Itanium’s memory addressing advantage. AMD64 supported 64-bit virtual addresses on x86, reaching the same large memory configurations. Windows XP x64 Edition and Windows Server 2003 x64 Edition, released in 2005, brought 64-bit capability to the standard x86 server ecosystem. The performance and compatibility case for Windows on Itanium collapsed.

Later Generations: Montecito and Tukwila

Intel continued Itanium development through several generations after Merced’s disappointing launch:

  • McKinley (2002): Significantly improved performance over Merced; Intel’s own benchmarks showed competitive floating-point performance with IBM POWER4.
  • Madison (2003): Moved to 0.13 micron process; doubled the L3 cache; competitive with POWER4+ for technical computing.
  • Montecito (2006): The most capable generation; dual-core; 1.72 billion transistors; 24 MB L3 cache; genuinely competitive with AMD Opteron and IBM POWER5+ for specific server workloads. Financial services customers with HP-UX deployments found Montecito acceptable for their existing workloads.
  • Tukwila (2010): Quad-core; 2 billion transistors; the fastest Itanium generation, but arriving in a market where the decision to not adopt Itanium had been made years earlier.
  • Poulson (2012): 8-core; new microarchitecture; Intel’s last major Itanium development investment.
  • Kittson (2017): A refresh of Poulson; effectively the last Itanium generation, continuing production through July 2021.

Montecito and Tukwila demonstrated that the EPIC architecture, given sufficient transistor budget, could produce competitive performance. They were too late. By 2006, the software ecosystem had re-aligned to AMD64/x86_64. Every new server deployment decision was made on the assumption of x86_64 compatibility; Itanium’s superior performance for HP-UX workloads was insufficient justification for betting a new application deployment on a non-standard architecture.

The Long Death

HP remained committed to Itanium for its HP-UX and NonStop server lines, and HP-UX on Itanium maintained a loyal installed base in financial services and telecommunications. Microsoft dropped Itanium support in Windows Server 2012. Oracle dropped Itanium support for new software releases in 2011 (after disputes with HP that resulted in litigation, with HP winning an injunction requiring Oracle to continue supporting Itanium). Red Hat dropped RHEL Itanium support in 2012. Intel continued producing Itanium chips — with diminishing investment and shrinking customer base — until the final Poulson-derived Kittson generation ended production in July 2021.

HP Enterprise (which had split from HP in 2015) announced end-of-life for HP-UX and Itanium-based systems in December 2025. When the last HP Itanium server goes dark, the architecture will be gone without successors, having cost Intel and HP combined estimates of $10 billion or more in development.

Dead End: When Compiler Theory Meets the Installed Base

The Itanium failure encodes a lesson about architectural transitions: the installed base of the incumbent architecture is not a constraint that better technology can overcome — it is the technology. x86’s 1978 instruction set was technically inferior to RISC architectures and far inferior to EPIC’s theoretical elegance. But x86 had thirty years of software investment, compiler optimization, hardware implementation experience, and peripheral ecosystem. Every cycle Intel put into Itanium was a cycle not put into making x86 faster, and Intel’s x86 engineers — freed from the distraction of supporting Itanium — eventually built x86 out-of-order execution engines (the Pentium Pro, the Core architecture) that matched or exceeded the RISC competition.

The compiler-centric model failed for a second reason: hardware became faster at hiding latency (larger caches, better branch predictors, speculative execution) precisely because the transistor budget that EPIC would have simplified the processor to free up was instead spent on these dynamic optimization techniques. By the time Itanium shipped, the hardware it was designed to replace had solved the problem EPIC was designed to address.


📚 Sources