Papers and technical articles from Weyl.Our internal AI lab producing the R&D that powers Fleek.
C++23 std::mdspan meets CUTLASS cute layouts
One header. Zero cost. 26 theorems. 0 sorry. A zero-overhead bridge between C++23 std::mdspan and CUTLASS cute layouts, with Lean 4 formalization of NVIDIA's layout algebra.
Build containers with nix2gpu that run on any GPU market
Introducing nix2gpu, a tool that builds containers capable of running on any GPU market. Leveraging Nix's reproducibility to create portable GPU workloads that work seamlessly across different infrastructure providers.
Meet Nimi
A tiny process manager that takes NixOS 25.11's modular services spec and runs it anywhere you need. Nimi reads JSON configuration, launches services with clean environments, streams logs to console, and handles shutdown and restart policies consistently.
The Operating System of the Drone War and The UTF-8 of AI
Part 1: The Operating System of the Drone War and The UTF-8 of AI. Constraints dominate resources, and the lattice doesn't negotiate. Explores NVFP4, infrastructure, DeepSeek, CUDA, Nix, embedded AI, and quantization.
Jensen's Razor and the malevolent combinatorics of CUDA architecture
Encoding NVIDIA's theorems as types through Gibson's lens. A comprehensive series on CUDA architecture and tensor cores, examining the mathematical foundations of GPU computing.
Layouts, Coordinate Spaces, and the CuTe Contract
The tensor core at the center of the Gothic folly. Examining layouts and coordinate spaces in CUDA's CuTe library through formal methods.
Coalescence, Noetherian Reduction, and Why the Gothic Folly Terminates
Examining memory coalescence patterns and proving termination through Noetherian reduction in GPU architectures.
Complementation, the FTTC, and the Holes in Your Iteration Space
The theorem that should terrify you. Exploring complementation and the Fundamental Theorem of Tensor Contraction in iteration space analysis.
Composition, the Tensor Core Cathedral, and Jensen's Razor
Never attribute to search what can be proven by construction. The culmination of the series examining composition and the razorgirl theorem.
The Lattice Hypothesis
Proposing that neural networks fundamentally operate on discrete floating-point lattices rather than continuous functions on ℝⁿ. Traditional continuous analysis is reconceptualized as the approximation, with the discrete lattice structure being the computational reality.
The Hallway Hypothesis
Examining how constraints that reduce wrong moves matter more than constraints that reduce total moves. Analysis of phenomena like LoRA effectiveness and quantization failure modes reveals the geometry of transformation corridors in neural network pipelines.
The Landauer Hypothesis
Treating precision not as a hyperparameter to optimize but as a physical quantity to measure. Explores how the thermodynamic cost of bit erasure constrains low-precision quantization schemes, with implications for NVFP4 deployment at the edge.
Deep-dive into CUDA architecture, tensor cores, and optimization techniques.