I Don't Know ML. Claude Does. 0.871 F1 on Predicting Linux Game Compatibility.
I’m a software engineer who knows ML at a surface level — a few papers read (never understood), zero serious ML projects shipped. Over two weeks, I built a system that predicts Linux game compatibility, achieving 0.871 F1 on 350K+ community reports. The key innovation came from a statistical technique I’d never heard of before Claude suggested it.
This post is two stories woven together: how the project works, and how a human-AI research loop made it possible.
The problem
Linux gaming has gotten remarkably good thanks to Valve’s Proton compatibility layer, but “will this game work on my hardware?” remains a frustrating question.
Why it’s hard to answer
Proton translates Windows API calls (DirectX, Win32) to Linux equivalents (Vulkan, native APIs). Whether this works depends on a messy combination of factors:
- Game engine — Unity, Unreal, proprietary engines all have different compatibility profiles
- Graphics API — DirectX 9 vs 11 vs 12 vs Vulkan native, each with different translation paths
- DRM and anticheat — kernel-level anticheat like EAC or BattlEye may or may not support Proton
- GPU vendor and driver — AMD’s open-source Mesa stack vs NVIDIA’s proprietary driver behave very differently
- Proton version — official, GE-Proton (community patches), or experimental
- Kernel version — newer kernels bring driver fixes and regressions alike
- Media codecs, launchers, overlays — can fail for reasons entirely unrelated to the game itself
A game that works flawlessly on AMD with GE-Proton might crash on NVIDIA with the same Proton version because of a driver-specific shader compilation bug. There’s no deterministic compatibility matrix.
ProtonDB: crowdsourced signal
ProtonDB tries to solve this with crowdsourcing. Users report whether a game is borked (broken), needs tinkering (requires tweaks), or works out of the box, along with their system info and notes. There are 350K+ such reports for 31K games.
The catch: these labels are subjective. One user says “I chose GE-Proton instead of the default — tinkering.” Another does the same and says “works out of the box.” This isn’t random noise — it’s systematic disagreement about what “tinkering” means, and it poisons any model you train on this data.
What I built
A two-stage cascade classifier that predicts game compatibility per (game, hardware) pair:
The full pipeline:
The system ingests ProtonDB data dumps, enriches them from Steam, PCGamingWiki, and other sources, trains LightGBM models with 123 features, and serves predictions via a FastAPI endpoint.
But the interesting part isn’t the architecture. It’s the research journey that got me from 0.593 F1 to 0.871 — a 47% relative improvement — and how Claude guided that journey.
The research loop
I started this project by opening Claude Code and saying something like “let’s plan a simple web API that predicts game compatibility from ProtonDB data.” From that point, a pattern emerged that repeated over 7,000+ messages across two weeks:
Claude didn’t just advise — it wrote all the code. The entire project — worker, preprocessing pipeline, LLM normalization, ML training, FastAPI server, CLI — every line was written by Claude.
Beyond implementation, it also conducted research into external data sources (Steam API, PCGamingWiki, AreWeAntiCheatYet, Steam PICS bulk protocol), investigated which features could be extracted, and ran feature importance analysis. My role was direction, domain judgment, and quality control — not writing code.
Phase 1: The baseline and the ceiling (F1 = 0.555 → 0.593)
The first model was straightforward — a single 3-class LightGBM on hardware features plus game metadata.
F1 = 0.555. The model was essentially guessing on works_oob — 56% error rate, with most samples misclassified as tinkering. borked was somewhat separable, but tinkering and works_oob were a single smeared blob.
Claude analyzed the error clustering and proposed splitting the problem: use a two-stage cascade — first ask “is it broken?”, then for non-broken games ask “does it need tweaking?” This gave us +0.009 F1 (to 0.593). A modest gain, but the right framing.
Phase 2: Feature archaeology (F1 = 0.593 → 0.724)
Claude proposed mining untapped data fields. ProtonDB reports contain free-text notes — “had to set PROTON_USE_WINED3D=1” or “just hit play and it worked perfectly.” We hadn’t touched this data.
The breakthrough came in layers:
- Text features — keyword detection, sentiment, note lengths: +0.099 F1
- Game metadata enrichment — engine type, DRM, anticheat from PCGamingWiki/Steam: +0.024 F1
- SVD embeddings — factorizing a GPU-family × game co-occurrence matrix into dense vectors: +0.010 F1
- Rule-based relabeling — if a
tinkeringreport mentions zero actual tweaks, relabel it asworks_oob: +0.010 F1
The SVD embeddings idea was particularly interesting — I wouldn’t have thought to treat hardware-game compatibility as a collaborative filtering problem, but Claude framed it that way and it worked.
LLM as a preprocessing tool (the part that didn't work)
I invested significant effort into using LLMs for data cleaning. Claude built the entire infrastructure — CLI commands, batch processing, multiple backends. But the honest result: none of it improved ML metrics.
GPU/CPU normalization. ~35K unique GPU strings. We built an LLM pipeline to normalize them, but a heuristic normalizer based on regex covers 99.7% of data with no hallucinations. The LLM hallucinated numbers. Production uses the heuristic.
Launch options parsing. ~16K unique Steam launch option strings → structured JSON. The pipeline works, but was never integrated. Per-game aggregate features come from simpler regex extraction.
Text extraction. A three-layer pipeline for free-text reports. Architecture complete — prompts, schemas, validation. But extracted_data has 0 records. The text features that did improve F1 (+0.099) are simple keyword regex — no LLM involved.
LLM verdict inference. We tried having the LLM directly classify reports. Result: −0.001 F1. IRT solves this better.
Phase 3: The wall (F1 = 0.724)
Then we hit a wall. Over Phases 9-11, we tried 14+ techniques — focal loss, ordinal regression, knowledge distillation, CatBoost, XGBoost, stacking ensembles, Node2Vec graph embeddings, target encoding — and every single one was neutral or negative.
This is the part that would have killed the project if I were working alone. When you’ve tried everything you know and nothing works, the natural instinct is to give up or start over.
I asked Claude to generate visualizations — t-SNE, UMAP, confusion matrices, calibration curves. Claude not only produced the plots but analyzed them itself — identifying that errors lived almost entirely on the tinkering↔works_oob boundary and were concentrated around specific (game, GPU family) pairs.
The conversation shifted from “what feature should we add next?” to “what’s the nature of this noise?”
Phase 4: The IRT breakthrough (F1 = 0.724 → 0.771)
This is the moment that made the project.
I had a hypothesis: the noise isn’t random — it’s annotator bias. Different people have genuinely different thresholds for “tinkering.” I asked Claude for a technique that could decompose subjective labels into annotator bias and ground truth.
Claude recognized this as a known problem in psychometrics and proposed Item Response Theory. The core idea: decompose each label into two components:
- (theta): how strict the annotator is (their personal threshold)
- (difficulty): how objectively difficult the game actually is to run
The formula:
A strict annotator (high ) is more likely to say “tinkering” for the same game that a lenient annotator (low ) would call “works out of the box.”
Once you have and , you can:
- Use them as features — the model knows “this report comes from a strict annotator”
- Relabel intelligently — if a very strict annotator () says
tinkeringbut there’s no evidence of actual tweaks, confidently relabel it asworks_oob
Phase 5: Aggregation (F1 = 0.771 → 0.871)
The final insight was almost embarrassingly simple: users don’t ask “what does this one report predict?” — they ask “will this game work?”
If you have 15 reports for a game and your model gets 12 right, majority vote gives you the correct answer even though per-report accuracy is imperfect.
Per-game majority voting boosted F1 from 0.780 to 0.871 — a +0.091 improvement, the largest absolute gain in the project. Individual errors cancel out.
Per-Report (F1=0.780)
Per-Game Vote (F1=0.871)
Everything we tried
Here’s the full inventory across 22 phases. Most things didn’t work — that’s the point.
Features that made the cut
| Feature | Source | Phase | ΔF1 |
|---|---|---|---|
| IRT game difficulty (d) | IRT 1PL on contributor×game | 12 | +0.030 |
| IRT contributor strictness () | IRT 1PL on contributor×game | 12 | (part of above) |
| Per-game customization rates (protontricks %, config %, custom proton %) | Aggregated from reports | 9.2 | +0.024 |
| Per-game launch flag rates (esync, fsync, wined3d, nvapi, etc.) | Aggregated from reports | 9.2 | +0.008 |
| variant (official / GE / experimental / native) | Report Proton type | 1 | Top Stage 2 feature by SHAP |
| Game SVD embeddings (20d) | GPU×Variant×Engine×Deck co-occurrence SVD | 8 | +0.008 vs original |
| GPU SVD embeddings (16d) | GPU-family×Game co-occurrence SVD | 1 | ΔF1 −0.003 when removed |
| has_concluding_notes, concluding_notes_length, fault_notes_count | Report text metadata | 5 | +0.077 (group) |
| mentions_crash, mentions_fix, mentions_perfect | Regex on notes | 5 | +0.045 (group) |
| mentions_env_var, mentions_proton_version, mentions_performance | Regex on notes | 5 | (part of above) |
| sentiment_positive_words, sentiment_negative_words | Word counting | 5 | (part of above) |
| kernel_major | System info | 1 | SHAP 0.20 |
| nvidia_driver_version, mesa_driver_version | System info + normalization | 4 | SHAP 0.21/0.18 |
| is_apu, is_igpu, is_mobile, is_steam_deck | Hardware classification | 5 | Form factor signals |
| report_age_days | Timestamp delta | 5 | SHAP 1.10 — top feature overall |
| gpu_family | Normalized GPU string | 1 | SHAP 0.10 |
| contributor-aware relabeling | IRT thresholds | 13.2 | +0.017 |
| label smoothing | Training technique | 9.1 | +0.002 |
| cross-entropy objective | LightGBM config | 9.1 | +0.018 |
| Cleanlab noise removal (3%) | Confident learning | 9.3 | +0.021 |
Features and techniques that didn't work (20 experiments)
| Feature / Technique | Source | Phase | ΔF1 | Why it failed |
|---|---|---|---|---|
| Focal loss | Training technique | 9.4 | −0.133 | Amplifies noise for noisy labels |
| Factorization Machines | Model architecture | 15.7 | −0.013 | Overfits on interaction terms |
| Game verdict trend (temporal slope) | Aggregated time series | 15.3 | −0.010 | Introduced temporal bias |
| Proton×Game SVD embeddings | Co-occurrence | 15.6 | −0.017 | report_age_days already captures temporal signal |
| Node2Vec graph embeddings | Game-hardware graph | 9.5 | −0.008 | SVD already captures structure |
| Proton×DX interaction features | Cross-features | 15.4 | −0.008 | Overfitting |
| Variant sub-models (separate model per Proton type) | Model architecture | 9.4 | −0.041 | Fragments training data |
| Game best Proton version, regression flags | Temporal analysis | 15.3b | −0.007 | Temporal bias |
| CatBoost, XGBoost, HistGBM | Alternative models | 11 | −0.002 | LightGBM already optimal |
| Ordinal regression | Model architecture | 9.4 | −0.002 | Already well-calibrated |
| Text distillation (teacher→student) | Knowledge distillation | 9.4 | −0.004 | Text features weak at inference |
| Hierarchical target encoding | Feature engineering | 9.5 | −0.002 | Overfitting |
| ProtonDB tier + community score | External signal | 9.5 | −0.002 | Target leakage |
| CPU SVD embeddings (16d) | CPU×Game co-occurrence | 1 | ≈0 | CPUs too homogeneous, no signal |
| Text embeddings (sentence-transformers, 32d) | Verdict notes | 10 | +0.005 but marginal | Low coverage (32%), redundant with keywords |
| Dawid-Skene | Annotator noise model | 9.3 | −0.058 | Needs per-annotator confusion matrix; too many parameters |
| Confidence weighting | Training technique | 13.3 | −0.001 | Introduces weighting bias |
| Threshold optimization | Post-hoc calibration | 18 | ≈0 | Already well-calibrated |
| LLM verdict inference | OpenRouter | 19 | −0.001 | IRT already optimal |
| Time-decay sample weighting | Training technique | 15.2 | ≈0 | No improvement |
Features that looked great but were target leakage
| Feature | Source | Gain | Why it’s leakage |
|---|---|---|---|
| tried_oob | Report form field | 2.5M | Part of the label definition — user marks this in the same form as verdict |
| duration (playtime) | Report form field | 252K | Borked games have short duration by definition |
| Per-report fault booleans (8 fields) | Report form | 3.8M total | User describes faults alongside verdict — same observation |
| Per-report customization flags (11 fields) | Report form | 31K | ”Did you tweak?” is the definition of tinkering |
| pct_works_oob (per-game) | Aggregated verdicts | SHAP 1.06 | Literally the target aggregated |
Dropped as redundant (Phase 7 ablation)
| Feature | ΔF1 when dropped | Reason |
|---|---|---|
| engine | +0.004 (improved!) | High cardinality, noisy; game embeddings capture it better |
| gpu_vendor | ≈0 | Redundant with gpu_family |
| ram_gb | −0.001 | Weak signal |
| cpu_generation, cpu_vendor | ≈0 | Redundant, no signal |
| os_family | ≈0 | Almost always Linux |
| developer, publisher | ≈0 | High cardinality noise |
| genre, release_year | ≈0 | Weak |
| graphics_api_* (5 binary) | −0.002 | Captured by game embeddings |
| anticheat, drm (game-level) | −0.001 | Sparse coverage |
| deck_status, deck battery/readable | −0.002 | 13% coverage |
| gpu_tier | −0.002 | Redundant with gpu embeddings |
Steam PICS features (Phase 14): 18 features from Steam’s bulk content protocol. Every single one was redundant with existing signals. ΔF1 ≈ 0.
What Claude is good at (and what it isn’t)
After 7,000+ messages I have a clear picture of where Claude excels and where it doesn’t.
Try it
The project is open source: protondb-game-compatibility-prediction. The IRT implementation is in protondb_settings/ml/irt.py, and the full experiment history is documented in docs/PLAN_ML_*.md.
If you’re considering using an AI coding assistant for ML research — not just code generation, but actual research — my advice is: focus on the loop, not the output. Claude will retrieve every known technique that fits your problem. But it won’t tell you when to stop, when the problem is framed wrong, or when a technically correct feature is actually target leakage. That’s still your job.
The magic is in the combination: Claude’s library plus your judgment, iterated 50 times instead of 5.
This workflow is becoming a pattern. Andrej Karpathy’s autoresearch project automates ML experiment loops — 100+ experiments overnight. The autoresearch skill for Claude Code generalizes this into a reusable skill. I didn’t use either — my workflow was more ad-hoc, with me in the loop at every step — but if you’re starting a similar project, they’re worth a look.
The data Valve actually has
Valve is sitting on the ideal dataset and they don’t need ProtonDB at all. On Steam Deck, there’s an opt-in prompt for FPS telemetry — frame times, crash events, hardware state — collected automatically.
With that data you could model not just compatibility but performance (does it hold 60fps? where does it stutter?). And because Valve controls the entire stack — Proton, Mesa, the kernel, the hardware — they could A/B test changes at every layer. Ship a new DXVK version to 5% of users, measure the FPS delta across 10,000 games, roll back automatically if regressions appear.
We’re working with ProtonDB because that’s what’s publicly available. But the real solution to “will this game work on Linux?” is telemetry at scale, and Valve is the only entity positioned to do it.