Contents

I Don't Know ML. Claude Does. 0.871 F1 on Predicting Linux Game Compatibility.

I’m a software engineer who knows ML at a surface level — a few papers read (never understood), zero serious ML projects shipped. Over two weeks, I built a system that predicts Linux game compatibility, achieving 0.871 F1 on 350K+ community reports. The key innovation came from a statistical technique I’d never heard of before Claude suggested it.

This post is two stories woven together: how the project works, and how a human-AI research loop made it possible.

The problem

Linux gaming has gotten remarkably good thanks to Valve’s Proton compatibility layer, but “will this game work on my hardware?” remains a frustrating question.

Why it’s hard to answer

Proton translates Windows API calls (DirectX, Win32) to Linux equivalents (Vulkan, native APIs). Whether this works depends on a messy combination of factors:

  • Game engine — Unity, Unreal, proprietary engines all have different compatibility profiles
  • Graphics API — DirectX 9 vs 11 vs 12 vs Vulkan native, each with different translation paths
  • DRM and anticheat — kernel-level anticheat like EAC or BattlEye may or may not support Proton
  • GPU vendor and driver — AMD’s open-source Mesa stack vs NVIDIA’s proprietary driver behave very differently
  • Proton version — official, GE-Proton (community patches), or experimental
  • Kernel version — newer kernels bring driver fixes and regressions alike
  • Media codecs, launchers, overlays — can fail for reasons entirely unrelated to the game itself

A game that works flawlessly on AMD with GE-Proton might crash on NVIDIA with the same Proton version because of a driver-specific shader compilation bug. There’s no deterministic compatibility matrix.

ProtonDB: crowdsourced signal

ProtonDB tries to solve this with crowdsourcing. Users report whether a game is borked (broken), needs tinkering (requires tweaks), or works out of the box, along with their system info and notes. There are 350K+ such reports for 31K games.

The catch: these labels are subjective. One user says “I chose GE-Proton instead of the default — tinkering.” Another does the same and says “works out of the box.” This isn’t random noise — it’s systematic disagreement about what “tinkering” means, and it poisons any model you train on this data.

What I built

A two-stage cascade classifier that predicts game compatibility per (game, hardware) pair:

Input: game + hardwareStage 1: Broken?Borked — F1 = 0.846Stage 2: Needs tweaking?Tinkering — F1 = 0.880Works OOBPer-game majority voteProduction F1 = 0.871 borked works tinkering works_oob

The full pipeline:

ProtonDB DumpSQLitePreprocessingTrainingAPI

The system ingests ProtonDB data dumps, enriches them from Steam, PCGamingWiki, and other sources, trains LightGBM models with 123 features, and serves predictions via a FastAPI endpoint.

But the interesting part isn’t the architecture. It’s the research journey that got me from 0.593 F1 to 0.871 — a 47% relative improvement — and how Claude guided that journey.

The research loop

I started this project by opening Claude Code and saying something like “let’s plan a simple web API that predicts game compatibility from ProtonDB data.” From that point, a pattern emerged that repeated over 7,000+ messages across two weeks:

Describe problemClaude proposesApprove?Implement + experimentInterpret results yes modify

Claude didn’t just advise — it wrote all the code. The entire project — worker, preprocessing pipeline, LLM normalization, ML training, FastAPI server, CLI — every line was written by Claude.

Beyond implementation, it also conducted research into external data sources (Steam API, PCGamingWiki, AreWeAntiCheatYet, Steam PICS bulk protocol), investigated which features could be extracted, and ran feature importance analysis. My role was direction, domain judgment, and quality control — not writing code.

Phase 1: The baseline and the ceiling (F1 = 0.555 → 0.593)

The first model was straightforward — a single 3-class LightGBM on hardware features plus game metadata.

F1 = 0.555. The model was essentially guessing on works_oob — 56% error rate, with most samples misclassified as tinkering. borked was somewhat separable, but tinkering and works_oob were a single smeared blob.

Claude analyzed the error clustering and proposed splitting the problem: use a two-stage cascade — first ask “is it broken?”, then for non-broken games ask “does it need tweaking?” This gave us +0.009 F1 (to 0.593). A modest gain, but the right framing.


Phase 2: Feature archaeology (F1 = 0.593 → 0.724)

Claude proposed mining untapped data fields. ProtonDB reports contain free-text notes — “had to set PROTON_USE_WINED3D=1” or “just hit play and it worked perfectly.” We hadn’t touched this data.

The breakthrough came in layers:

  • Text features — keyword detection, sentiment, note lengths: +0.099 F1
  • Game metadata enrichment — engine type, DRM, anticheat from PCGamingWiki/Steam: +0.024 F1
  • SVD embeddings — factorizing a GPU-family × game co-occurrence matrix into dense vectors: +0.010 F1
  • Rule-based relabeling — if a tinkering report mentions zero actual tweaks, relabel it as works_oob: +0.010 F1

The SVD embeddings idea was particularly interesting — I wouldn’t have thought to treat hardware-game compatibility as a collaborative filtering problem, but Claude framed it that way and it worked.

LLM as a preprocessing tool (the part that didn't work)

I invested significant effort into using LLMs for data cleaning. Claude built the entire infrastructure — CLI commands, batch processing, multiple backends. But the honest result: none of it improved ML metrics.

GPU/CPU normalization. ~35K unique GPU strings. We built an LLM pipeline to normalize them, but a heuristic normalizer based on regex covers 99.7% of data with no hallucinations. The LLM hallucinated numbers. Production uses the heuristic.

Launch options parsing. ~16K unique Steam launch option strings → structured JSON. The pipeline works, but was never integrated. Per-game aggregate features come from simpler regex extraction.

Text extraction. A three-layer pipeline for free-text reports. Architecture complete — prompts, schemas, validation. But extracted_data has 0 records. The text features that did improve F1 (+0.099) are simple keyword regex — no LLM involved.

LLM verdict inference. We tried having the LLM directly classify reports. Result: −0.001 F1. IRT solves this better.


Phase 3: The wall (F1 = 0.724)

Then we hit a wall. Over Phases 9-11, we tried 14+ techniques — focal loss, ordinal regression, knowledge distillation, CatBoost, XGBoost, stacking ensembles, Node2Vec graph embeddings, target encoding — and every single one was neutral or negative.

This is the part that would have killed the project if I were working alone. When you’ve tried everything you know and nothing works, the natural instinct is to give up or start over.

I asked Claude to generate visualizations — t-SNE, UMAP, confusion matrices, calibration curves. Claude not only produced the plots but analyzed them itself — identifying that errors lived almost entirely on the tinkeringworks_oob boundary and were concentrated around specific (game, GPU family) pairs.

Error Type Breakdown 76% of errors are tinkering ↔ works_oob works_oob → tinkering 40% tinkering → works_oob 36% borked → tinkering 12% tinkering → borked 7% borked → works_oob 3% works_oob → borked 2%

The conversation shifted from “what feature should we add next?” to “what’s the nature of this noise?”


Phase 4: The IRT breakthrough (F1 = 0.724 → 0.771)

This is the moment that made the project.

I had a hypothesis: the noise isn’t random — it’s annotator bias. Different people have genuinely different thresholds for “tinkering.” I asked Claude for a technique that could decompose subjective labels into annotator bias and ground truth.

Claude recognized this as a known problem in psychometrics and proposed Item Response Theory. The core idea: decompose each label into two components:

  • θ\theta (theta): how strict the annotator is (their personal threshold)
  • dd (difficulty): how objectively difficult the game actually is to run

The formula:

P(tinkering)=σ(θd)P(\text{tinkering}) = \sigma(\theta - d)

A strict annotator (high θ\theta) is more likely to say “tinkering” for the same game that a lenient annotator (low θ\theta) would call “works out of the box.”

Once you have θ\theta and dd, you can:

  1. Use them as features — the model knows “this report comes from a strict annotator”
  2. Relabel intelligently — if a very strict annotator (θ>1.5\theta > 1.5) says tinkering but there’s no evidence of actual tweaks, confidently relabel it as works_oob
Contributor Strictness (θ) Count 0 300 600 900 1200 -8-6-4-20246810 0 mean=1.71
Game×GPU Difficulty (d) Count 0 400 800 1200 1600 -8-6-4-202468 0 mean=-1.11

Phase 5: Aggregation (F1 = 0.771 → 0.871)

The final insight was almost embarrassingly simple: users don’t ask “what does this one report predict?” — they ask “will this game work?”

If you have 15 reports for a game and your model gets 12 right, majority vote gives you the correct answer even though per-report accuracy is imperfect.

Per-game majority voting boosted F1 from 0.780 to 0.871 — a +0.091 improvement, the largest absolute gain in the project. Individual errors cancel out.

Per-Report (F1=0.780)

Actual Predicted borked tinkering works_oob borked tinkering works_oob

Per-Game Vote (F1=0.871)

Actual Predicted borked tinkering works_oob borked tinkering works_oob
Cumulative ML Pipeline Progress 0.700 0.725 0.750 0.775 0.800 0.825 0.850 0.875 0.900 Original baseline 0.725 Phase 11 Baseline 0.754 +0.029 +IRT (Phase 12) 0.771 +0.017 +Relabel (Phase 13) 0.778 +0.007 +ClassWeight (Phase 16) 0.780 +0.002 +HP Tuning (Phase 17) 0.871 +0.091 Per-game vote (21)

Everything we tried

Here’s the full inventory across 22 phases. Most things didn’t work — that’s the point.

Features that made the cut

FeatureSourcePhaseΔF1
IRT game difficulty (d)IRT 1PL on contributor×game12+0.030
IRT contributor strictness (θ\theta)IRT 1PL on contributor×game12(part of above)
Per-game customization rates (protontricks %, config %, custom proton %)Aggregated from reports9.2+0.024
Per-game launch flag rates (esync, fsync, wined3d, nvapi, etc.)Aggregated from reports9.2+0.008
variant (official / GE / experimental / native)Report Proton type1Top Stage 2 feature by SHAP
Game SVD embeddings (20d)GPU×Variant×Engine×Deck co-occurrence SVD8+0.008 vs original
GPU SVD embeddings (16d)GPU-family×Game co-occurrence SVD1ΔF1 −0.003 when removed
has_concluding_notes, concluding_notes_length, fault_notes_countReport text metadata5+0.077 (group)
mentions_crash, mentions_fix, mentions_perfectRegex on notes5+0.045 (group)
mentions_env_var, mentions_proton_version, mentions_performanceRegex on notes5(part of above)
sentiment_positive_words, sentiment_negative_wordsWord counting5(part of above)
kernel_majorSystem info1SHAP 0.20
nvidia_driver_version, mesa_driver_versionSystem info + normalization4SHAP 0.21/0.18
is_apu, is_igpu, is_mobile, is_steam_deckHardware classification5Form factor signals
report_age_daysTimestamp delta5SHAP 1.10 — top feature overall
gpu_familyNormalized GPU string1SHAP 0.10
contributor-aware relabelingIRT θ\theta thresholds13.2+0.017
label smoothing α=0.15\alpha = 0.15Training technique9.1+0.002
cross-entropy objectiveLightGBM config9.1+0.018
Cleanlab noise removal (3%)Confident learning9.3+0.021
Features and techniques that didn't work (20 experiments)
Feature / TechniqueSourcePhaseΔF1Why it failed
Focal lossTraining technique9.4−0.133Amplifies noise for noisy labels
Factorization MachinesModel architecture15.7−0.013Overfits on interaction terms
Game verdict trend (temporal slope)Aggregated time series15.3−0.010Introduced temporal bias
Proton×Game SVD embeddingsCo-occurrence15.6−0.017report_age_days already captures temporal signal
Node2Vec graph embeddingsGame-hardware graph9.5−0.008SVD already captures structure
Proton×DX interaction featuresCross-features15.4−0.008Overfitting
Variant sub-models (separate model per Proton type)Model architecture9.4−0.041Fragments training data
Game best Proton version, regression flagsTemporal analysis15.3b−0.007Temporal bias
CatBoost, XGBoost, HistGBMAlternative models11−0.002LightGBM already optimal
Ordinal regressionModel architecture9.4−0.002Already well-calibrated
Text distillation (teacher→student)Knowledge distillation9.4−0.004Text features weak at inference
Hierarchical target encodingFeature engineering9.5−0.002Overfitting
ProtonDB tier + community scoreExternal signal9.5−0.002Target leakage
CPU SVD embeddings (16d)CPU×Game co-occurrence1≈0CPUs too homogeneous, no signal
Text embeddings (sentence-transformers, 32d)Verdict notes10+0.005 but marginalLow coverage (32%), redundant with keywords
Dawid-SkeneAnnotator noise model9.3−0.058Needs per-annotator confusion matrix; too many parameters
Confidence weightingTraining technique13.3−0.001Introduces weighting bias
Threshold optimizationPost-hoc calibration18≈0Already well-calibrated
LLM verdict inferenceOpenRouter19−0.001IRT already optimal
Time-decay sample weightingTraining technique15.2≈0No improvement
Features that looked great but were target leakage
FeatureSourceGainWhy it’s leakage
tried_oobReport form field2.5MPart of the label definition — user marks this in the same form as verdict
duration (playtime)Report form field252KBorked games have short duration by definition
Per-report fault booleans (8 fields)Report form3.8M totalUser describes faults alongside verdict — same observation
Per-report customization flags (11 fields)Report form31K”Did you tweak?” is the definition of tinkering
pct_works_oob (per-game)Aggregated verdictsSHAP 1.06Literally the target aggregated
Dropped as redundant (Phase 7 ablation)
FeatureΔF1 when droppedReason
engine+0.004 (improved!)High cardinality, noisy; game embeddings capture it better
gpu_vendor≈0Redundant with gpu_family
ram_gb−0.001Weak signal
cpu_generation, cpu_vendor≈0Redundant, no signal
os_family≈0Almost always Linux
developer, publisher≈0High cardinality noise
genre, release_year≈0Weak
graphics_api_* (5 binary)−0.002Captured by game embeddings
anticheat, drm (game-level)−0.001Sparse coverage
deck_status, deck battery/readable−0.00213% coverage
gpu_tier−0.002Redundant with gpu embeddings

Steam PICS features (Phase 14): 18 features from Steam’s bulk content protocol. Every single one was redundant with existing signals. ΔF1 ≈ 0.

What Claude is good at (and what it isn’t)

After 7,000+ messages I have a clear picture of where Claude excels and where it doesn’t.

Try it

The project is open source: protondb-game-compatibility-prediction. The IRT implementation is in protondb_settings/ml/irt.py, and the full experiment history is documented in docs/PLAN_ML_*.md.

If you’re considering using an AI coding assistant for ML research — not just code generation, but actual research — my advice is: focus on the loop, not the output. Claude will retrieve every known technique that fits your problem. But it won’t tell you when to stop, when the problem is framed wrong, or when a technically correct feature is actually target leakage. That’s still your job.

The magic is in the combination: Claude’s library plus your judgment, iterated 50 times instead of 5.

This workflow is becoming a pattern. Andrej Karpathy’s autoresearch project automates ML experiment loops — 100+ experiments overnight. The autoresearch skill for Claude Code generalizes this into a reusable skill. I didn’t use either — my workflow was more ad-hoc, with me in the loop at every step — but if you’re starting a similar project, they’re worth a look.

The data Valve actually has

Valve is sitting on the ideal dataset and they don’t need ProtonDB at all. On Steam Deck, there’s an opt-in prompt for FPS telemetry — frame times, crash events, hardware state — collected automatically.

With that data you could model not just compatibility but performance (does it hold 60fps? where does it stutter?). And because Valve controls the entire stack — Proton, Mesa, the kernel, the hardware — they could A/B test changes at every layer. Ship a new DXVK version to 5% of users, measure the FPS delta across 10,000 games, roll back automatically if regressions appear.

We’re working with ProtonDB because that’s what’s publicly available. But the real solution to “will this game work on Linux?” is telemetry at scale, and Valve is the only entity positioned to do it.