Chess AI Lab v2 — Project Map

Overview

Core projects

Research tier

Dropped

Build order

Monetization

Core projects

Research tier

Dropped

Data layers

What changed and why

Kept and reframed: C1 (Style DNA), C2 (RL Coach), C3 (Openings), C4 (Opponent), C8 (Clock). All now have GM data as a comparison layer, not just personal games.

Upgraded to research tier: C7 becomes the Rating Band Atlas — genuinely publishable, addresses an open gap in CSSLab's Maia-2 work.

Two new projects: Behavioral Fingerprint Contribution (feeds the fair-play research community) and Style-Matched GM Curriculum (the coaching product).

Dropped: C5 Chess Vision (solved problem), I4 AWS Serverless (unnecessary complexity given Mac Mini + Hetzner setup).

Core principle shift: GM data stops being a separate module (C7) and becomes the comparison backbone that makes every personal project meaningful.

AI / ML concept tag legend

ML fundamentals Reinforcement learning Gen AI — RAG Gen AI — Agents Gen AI — LLM / fine-tuning Embeddings Statistical / research methods Data engineering

The three data layers

Personal

10K chess.com games. Your behavioral fingerprint. Longitudinal time-series. Nobody else has this about you.

Rating-band peers

Lichess games filtered to your Elo ±200. What humans at your level actually play — not GM theory.

GM corpus

4M+ Lichess elite games. Pattern library for positions you reach. Filtered to lines you actually play.

C1Behavioral fingerprint + drift tracker

High priorityMonetizable

Your move distribution is as unique as a fingerprint — CSSLab proved this. The upgraded C1 doesn't just build the fingerprint once; it tracks how it changes over time. Are you becoming more positional? Is your time-pressure behavior improving? Is your style converging toward or away from success patterns at your rating band?

Original framing

Compare your DNA against Magnus. One-time snapshot.

New framing

Longitudinal behavioral time-series. Monthly drift report. Alerts when your profile shifts significantly.

What you build

Feature pipeline extracting: move-type distribution per game phase, time allocation patterns per move, position-type preferences (tactical vs positional), deviation frequency from Maia predictions at your Elo. Stored monthly as behavioral snapshots in Postgres. Drift detection using PSI threshold same as C6.

GM data role

Compare your fingerprint against Lichess players in your rating band ±200 who improved by 200+ Elo in 12 months. What did their behavioral profiles look like before the jump? That's your actual target signal.

python-chessPostgresMaia featuresEvidently AIPSI drift

AI / ML concepts required

ML fundamentalsFeature engineeringBehavioural stylometryTime-series analysisDimensionality reduction

StatisticalDistribution drift (PSI)Cosine similarityPopulation stability index

Data engineeringPGN parsingLongitudinal data modellingSnapshot storage

Abstract, research questions & tasks▼

Abstract

Individual chess players exhibit stable, measurable behavioural signatures across their games — patterns of move timing, position-type preference, and decision-making style that persist over time and are distinct enough to identify a player. This project constructs a longitudinal behavioural fingerprint from 10,000 personal chess games, tracks its evolution month-by-month, and compares it against the profiles of players in the same rating band who achieved significant rating improvement. The central hypothesis is that meaningful skill improvement is preceded by measurable behavioural shift, and that those shifts can be identified and targeted in advance.

Research questions

Do individual chess players exhibit stable, consistent behavioural fingerprints across time controls and game phases, and how stable are these fingerprints over 6–12 month periods?
Which specific behavioural features — time allocation, move-type distribution, positional preference — show the strongest predictive signal for upcoming rating improvement?
Is there a measurable behavioural precursor to rating stagnation (plateau), and can it be detected before the plateau becomes apparent in Elo alone?
How does a player's behavioural fingerprint diverge from their rating-band peers during periods of rapid improvement vs stagnation?

Tasks

Parse all 10K PGN games and extract per-move features: think time, eval delta, game phase, position type (tactical/positional/endgame), ECO code
Implement Maia move-match probability scoring at your Elo band for each game
Build monthly snapshot aggregation pipeline — store feature distributions per 30-day window in Postgres
Implement PSI drift detection: alert when any feature distribution shifts beyond threshold vs prior 3-month baseline
Pull Lichess DB cohort: players within ±200 Elo who gained 200+ Elo in 12 months — extract their pre-improvement fingerprints
Build cosine similarity scorer: compare your monthly snapshots against the improvement-cohort profiles
Build visualisation dashboard: fingerprint radar chart + drift timeline + cohort comparison overlay
Write monthly auto-report: plain-language summary of what changed in your profile this month

C2Mistake pattern coach with GM context

High priorityMonetizableCSSLab adjacent

Your blunder classifier already finds your problem positions. The upgrade: when it surfaces a weakness pattern, it doesn't show you the engine line. It shows you how players 200–400 Elo above you — who actually reach the same positions from the same openings you play — handle it.

Original output

"You blunder in IQP positions under time pressure. Here's Stockfish's best move."

New output

"Here's how 1800-rated players who play your exact Sicilian line handle this structure — and what they do differently in the 5 moves before the critical moment."

The key pipeline addition

After blunder classification, query Lichess GM DB filtered by: same opening line (ECO code), same position type, player Elo 1800–2200, games from last 3 years. Extract 10 moves around the critical moment. Feed to Claude for natural language pattern extraction in terms of principles, not engine lines.

XGBoostLichess DBStockfish evalClaude APIPivotRLPAHF

AI / ML concepts required

ML fundamentalsSupervised classificationGradient boosting (XGBoost)Anomaly detectionClass imbalance handling

Reinforcement learningReward shapingPolicy optimisationCurriculum learningAdaptive difficulty (PivotRL)

Gen AI — LLMStructured promptingChain-of-thoughtPersonalised feedback (PAHF)

Data engineeringEval batch pipelinesPosition taggingGame phase segmentation

Abstract, research questions & tasks▼

Abstract

Chess improvement tools overwhelmingly focus on what the engine would have played, not on what a human at a slightly higher skill level actually plays in the same situation. This project builds a personalised mistake classifier trained on one player's 10,000-game error history, tagged by position type, game phase, opponent rating, and time pressure. When a recurring weakness is detected, the system retrieves real games from Lichess played by humans 200–400 Elo higher in the identical structural context — then uses an LLM to extract the principled differences in their approach, not just the engine's top move. The output is human-readable coaching grounded in realistic human behaviour, not superhuman engine perfection.

Research questions

Can a personal blunder classifier trained on one player's error history outperform generic centipawn-loss thresholds for predicting future mistakes in similar positions?
Do players at Elo +200–400 exhibit consistently different move-selection behaviour in positions where lower-rated players frequently err, and can those differences be extracted as transferable principles?
Does RL-driven adaptive puzzle selection based on personal mistake patterns produce measurable rating improvement over random or engine-curated puzzle selection within 30 days?
How much does time-pressure context (clock <30s) improve blunder prediction accuracy beyond position-type features alone?

Tasks

Build mistake dataset from 10K games: tag every eval drop >0.5 pawns with position type, game phase, time remaining, opponent Elo, ECO code
Train XGBoost blunder-risk classifier — features: position type, phase, clock time, eval trend, move number, opponent rating delta
Evaluate classifier: compare against CPL baseline — measure precision/recall on held-out games
Build Lichess GM query layer: given a position type + ECO code, retrieve 20 games from players Elo 1800–2200 containing that structure
Build Claude annotation pipeline: extract the 10 moves around the critical moment, prompt for principle-based explanation (not engine lines)
Implement PivotRL puzzle selector: surface puzzles from your highest-risk position categories, track solve rate over 30 days
Implement ParamMem diversity check: store last 10 feedback outputs per mistake category, inject as negative examples to prevent repetitive advice
Build 30-day improvement tracker: measure blunder rate per position type week-over-week, plot against puzzle exposure

C3Personalised repertoire builder

High priorityMonetizable

Not "you deviated from theory at move 8." The real insight: what do players at your exact rating band actually play in the positions you reach? Opening recommendations based on what wins at your level, filtered by your behavioral style from C1 — avoiding lines that reach your C2 problem positions.

Original framing

Detect where you deviate from GM theory. Flag bad deviations.

New framing

Build a repertoire from what actually works at your Elo, matched to your style profile. Avoid lines leading to your weak position types.

Style-aware filtering (the key innovation)

Take your C1 fingerprint — specifically your positional vs tactical preference score and endgame win rate. Filter opening recommendations to avoid lines that statistically lead to position types where you underperform. A tactical player who loses every Carlsbad endgame should not be recommended the Exchange Slav.

Lichess opening DBECO classifierQdrantC1 fingerprintDoc-to-LoRA

AI / ML concepts required

Gen AI — RAGHybrid search (BM25 + dense)Vector embeddingsRetrieval-augmented generationReranking

Gen AI — fine-tuningLoRA / QLoRADoc-to-LoRA compressionDomain adaptation

ML fundamentalsMulti-label classification (ECO)Style-conditioned filteringRecommendation systems

EmbeddingsPosition embeddings (FEN)nomic-embed-text

Abstract, research questions & tasks▼

Abstract

Opening preparation tools universally teach GM mainlines, which are often irrelevant or counterproductive below 1800 Elo — positions that theory considers equal are won and lost asymmetrically at club level depending on the specific imbalances they create. This project builds a personalised opening repertoire grounded in two filters: what actually succeeds statistically at the player's specific rating band, and which lines align with the player's behavioural fingerprint from C1. Lines that lead to position types where the player consistently underperforms are deprioritised regardless of their theoretical evaluation, while lines that reach structurally favourable ground for that player's style are surfaced and annotated.

Research questions

Do opening lines that are theoretically sound (engine-equal) show significantly different win rates across rating bands, and can those differences be modelled statistically from Lichess data?
Can a player's C1 behavioural fingerprint (positional vs tactical preference, endgame win rate) meaningfully predict which opening structures will yield better outcomes for that specific player?
Does style-aware opening filtering — excluding lines that lead to the player's weak position types — improve tournament results compared to theory-first recommendations?
How many games of personal history are required before the style-conditioned recommendation stabilises and stops changing significantly with new data?

Tasks

Build ECO classifier on personal game history — tag every game with opening family, variation depth, and deviation move number
Pull Lichess rating-band stats: for every ECO code you play, compute win/draw/loss rates at your Elo ±200 from the last 2 years
Index opening positions as FEN embeddings in Qdrant with BM25 + dense hybrid search
Implement style-conditioned filter: use C1 fingerprint to score each opening line by its expected position-type outcome vs your weak categories
Build repertoire recommendation engine: given your colour and opening choice, rank candidate variations by (rating-band win rate × style alignment score)
Compress GM opening library into Doc-to-LoRA adapter by opening family — enables cheap local inference for opening lookups
Build deviation alerter: flag games where you left your recommended repertoire and classify whether the deviation was beneficial or harmful via Stockfish
Write repertoire report: top 5 recommended lines per colour with rationale grounded in your style profile

C4Opponent profiling agent

Medium priorityMonetizable

Pre-game brief from opponent's public game history: openings, collapse patterns, time tendencies, Elo trajectory. The upgrade: cross-reference their weaknesses against your strengths from C1. Find matchup edges, not just their profile in isolation.

New addition: matchup scoring

Your C1 fingerprint vs opponent's extracted profile → a matchup score per opening line. "Opponent struggles in endgames; your endgame win rate is 67% — steer toward these exchanges." This is what seconds do for elite players. Nobody has automated it for club players.

chess.com APIC1 fingerprintLangGraphClaude API

AI / ML concepts required

Gen AI — AgentsMulti-agent orchestrationLangGraph state machinesTool-calling agentsAgentic memory (MAPLE)

Gen AI — LLMStructured output generationPersona-conditioned prompting

ML fundamentalsProfiling / clusteringSimilarity scoringBehavioural pattern extraction

Data engineeringAPI data ingestionPGN feature extraction

Abstract, research questions & tasks▼

Abstract

Elite chess players routinely employ human seconds to prepare against specific opponents — analysing their opening tendencies, collapse patterns under time pressure, and psychological vulnerabilities. This preparation is unavailable to club players. This project automates opponent profiling by ingesting an opponent's public game history from chess.com, extracting behavioural features using the same pipeline as C1, and cross-referencing their weaknesses against the player's own strength profile. The output is a pre-game brief that surfaces concrete matchup edges: openings where the opponent struggles, time controls that expose their weaknesses, and structural imbalances that favour the player's own style.

Research questions

Can opponent behavioural profiles extracted from publicly available game histories reliably predict their weaknesses in specific structural or time-pressure scenarios?
Does cross-referencing opponent weaknesses against the player's own strength profile (matchup scoring) improve pre-game opening selection compared to theory-only preparation?
How many games of opponent history are required before the extracted profile stabilises and produces actionable, consistent recommendations?
Can opponent Elo trajectory (rising vs plateauing) serve as a reliable signal for their current psychological state and openness to pressure?

Tasks

Build opponent game fetcher: pull last 100 games from chess.com public API for any given username
Run C1 feature extraction pipeline on opponent's games — produce their behavioural fingerprint
Build collapse detector: identify time controls, game phases, and position types where opponent's accuracy drops most sharply
Build opening tendency extractor: most frequent ECO codes, deviation points, and win rates per opening family
Implement matchup scorer: cosine similarity between opponent weak features and your strong features from C1 — rank opening lines by matchup edge
Build LangGraph orchestrator: fetcher → profiler → matchup scorer → brief generator as a 4-node pipeline
Build brief generator: Claude produces a 1-page pre-game brief — opening recommendations, watch-outs, psychological tendencies — grounded in the data
Add MAPLE memory layer: store opponent profiles in Qdrant so repeated opponents are instant retrieval, not recomputed

C8Psychological clock agent

Medium priorityDataset contribution

Core unchanged — extract move timestamps, correlate think-time with eval drops, find your personal danger zone. Upgraded role: becomes your primary contribution to the fair-play research community. Your self-labeled time-pressure data is exactly what detection teams say they're missing.

Original role

Personal coaching tool. Find your clock danger zone.

New role

Personal tool + labeled behavioral dataset with granular self-labels: prep moves, calculation moves, time-pressure moves. CSSLab-compatible format.

PGN timestampsStockfish evalBehavioral labelsCSSLab format

AI / ML concepts required

ML fundamentalsTemporal pattern analysisCorrelation analysisBehavioural labellingAnomaly detection

StatisticalTime-series regressionConfidence intervalsThreshold-based alerting

Data engineeringPGN timestamp extractionMove-level annotationDataset schema designHuggingFace dataset release

Abstract, research questions & tasks▼

Abstract

Chess move timestamps encode a rich psychological signal that is almost entirely unexploited by existing analysis tools. The time a player spends on a move — relative to their own baseline, the game phase, and the position's complexity — reveals their cognitive state: confident theory recall, deep calculation, intuitive judgment, or time-pressured panic. This project extracts per-move think-time from PGN clock data, correlates it with eval drops and game outcomes, and constructs a personal "danger zone" map: the specific combinations of time remaining, move number, and position type where this player's decision quality degrades most sharply. As a secondary output, the self-labeled dataset produced becomes a research contribution to the fair-play detection community.

Research questions

What is the functional relationship between think time (normalised to personal baseline) and move quality across different game phases and position types?
Can personal time-pressure thresholds — the clock value below which a specific player's accuracy degrades significantly — be reliably identified from historical game data?
Do post-blunder moves exhibit a statistically distinct temporal signature (think time, subsequent accuracy) that can be used to model emotional recovery patterns?
How distinguishable are the temporal signatures of different human move categories (theory recall, deep calculation, intuition, time pressure) from one another — and from engine-assisted moves?

Tasks

Extract clock timestamps from all 10K PGN games — compute per-move think time and remaining time
Build personal think-time baseline per game phase and move number — normalise all think times against this baseline
Correlate normalised think time against eval delta — plot think-time vs eval drop scatter across position types
Identify personal danger zone: threshold analysis to find the clock/move-number combinations where eval drops spike
Build move-category labeller: classify each move as prep (fast, early, theory), calculation (slow, mid-game), intuition (fast, mid-game), time-pressure (fast, low clock), or post-blunder
Validate labels: check Maia move-match probability distributions per category — expect prep and calculation to diverge significantly
Package dataset: 10K games × move-level features (think time, category label, eval, Maia score, phase, ECO) in Parquet format
Release on HuggingFace with dataset card describing schema, labelling methodology, and intended use for fair-play research

R1Rating band behavioral atlas

High priorityPublishableExtends Maia-2

Your strongest research contribution and the one that opens doors to CSSLab, KTH, and research hiring. Maia-2 (NeurIPS 2024) built a unified model of human chess behavior across skill levels but acknowledged it doesn't produce coherent improvement pathways. This project builds that missing map.

What you build

Take the full Lichess DB. Segment by rating in 100-Elo bands. For each band extract: time allocation per game phase, move-type distribution, tactical vs positional preference, blunder frequency by position type, opening diversity, endgame conversion rate. Build a continuous behavioral map showing what specifically changes between each band — not "play better moves" but concrete behavioral shifts.

Why it's publishable

CSSLab explicitly said in the Maia-2 paper that coherent improvement pathways are the missing piece. You're building a data-driven answer to that gap using the same Lichess corpus they use. A workshop paper at KDD or CHI submission on human-AI learning in chess is realistic with solid results.

Personal data connection

Once the atlas exists, plot your own behavioral fingerprint against it. See exactly where you sit, which band's patterns you're closest to, and which specific behavioral shifts would move you to the next band. Your personal project and research contribution become the same pipeline.

Lichess full DBpython-chessMaia featuresPostgresNeurIPS/KDD targetCSSLab outreach

AI / ML concepts required

ML fundamentalsLarge-scale feature extractionBehavioural stylometryClustering (k-means / UMAP)Longitudinal cohort analysis

Human-AI alignmentSkill-level modellingCoherence across distributionsMaia-2 extension methodology

StatisticalDistributional shift analysisEffect size measurementSignificance testing

Data engineeringBig data PGN processingElo band segmentationPostgres time-series schema

Abstract, research questions & tasks▼

Abstract

Maia-2 (NeurIPS 2024) established that human chess behaviour can be modelled coherently across skill levels using a unified neural architecture, but explicitly acknowledged that the model does not produce actionable improvement pathways — it cannot tell a 1400-rated player which specific behavioural changes would move them to 1600. This project addresses that gap directly. Using the full Lichess database segmented into 100-Elo bands, we extract a multi-dimensional behavioural profile for each band: time allocation per game phase, move-type distribution, tactical vs positional preference, blunder category frequencies, and endgame conversion rates. The result is a continuous behavioural atlas that maps what concretely and measurably changes in how players think and decide as skill increases — turning the Maia-2 model into an actionable coaching instrument.

Research questions

Which behavioural features — time allocation, move-type distribution, blunder category, positional preference — show the largest and most consistent change across adjacent 100-Elo bands in the Lichess population?
Are the behavioural transitions between Elo bands gradual and continuous, or do they cluster around specific threshold ratings where qualitative shifts occur?
Does a player's current behavioural profile predict their likely Elo band more accurately than their actual game outcomes over a short window, and can this be used for early identification of under/over-rated players?
Which specific behavioural features at Elo band N best predict whether a player will progress to band N+1 within 6 months, based on longitudinal Lichess data?

Tasks

Download and decompress Lichess DB monthly dumps — start with 6 months of data, ~50M games
Build efficient PGN batch processor: extract per-game features at scale using python-chess + multiprocessing
Segment all games into 100-Elo bands (both players) — store band-level aggregate feature distributions in Postgres
Run Maia move-match probability at each band's Elo — use as a human-likeness baseline per band
Compute effect sizes between adjacent bands for each feature — identify which features change most significantly
Apply UMAP dimensionality reduction to band feature vectors — visualise the behavioural space as a 2D atlas
Run significance testing on band-to-band feature deltas — filter to features with p < 0.01 and Cohen's d > 0.3
Plot your personal C1 fingerprint onto the atlas — identify where you sit and which band's profile you're closest to
Write up preliminary findings as a blog post with atlas visualisation — target r/chess and r/MachineLearning
Email CSSLab (University of Toronto) with preliminary results and atlas methodology — propose collaboration or feedback
Draft workshop paper for KDD 2027 or CHI 2027 — deadline typically September 2026

R2Style-matched GM curriculum

High priorityMonetizableResearch angle

The coaching product. Every chess improvement platform teaches GM mainlines. None personalise to your behavioral style. This uses your C1 fingerprint to find which GMs' decision-making profiles most resemble yours structurally — then pulls their games filtered to positions you actually reach and areas where you struggle.

The matching algorithm

Extract behavioral features from a curated set of historical GMs using the same C1 pipeline: time allocation patterns, positional vs tactical ratio, endgame preferences, opening diversity. Cosine similarity between your feature vector and each GM's vector across your problem position types. Output: top 3 style-matched GMs with confidence score per position category.

The curriculum layer

Filter matched GM's Lichess games to: your opening lines (ECO codes you play), your C2 problem position types, comparable opponent Elo. Extract the 10–15 most instructive examples. Claude generates natural language annotations explaining the GM's decision at each critical moment in terms of principles, not engine lines.

Why it doesn't compete with Chessable

Chessable is a content platform. This is a personalisation engine. B2B angle: sell to chess coaches as a tool that auto-generates a personalised study plan for each student. Coaches pay €30–50/month per tool. That's a different business model — recurring SaaS, not one-time course purchase.

C1 fingerprintGM PGN corpusCosine similarityClaude APICoach B2B

AI / ML concepts required

Gen AI — RAGSemantic retrievalContext-window managementPersonalised retrieval

Gen AI — LLMInstructional annotation generationFew-shot promptingLong-context reasoning

Gen AI — AgentsMulti-step agentic pipelineMemory-augmented agents (MemCollab)Personalisation agents (PAHF)

ML fundamentalsCosine similarity matchingVector space modellingStyle transfer concepts

EmbeddingsBehavioural feature vectorsQdrant hybrid indexing

Abstract, research questions & tasks▼

Abstract

Chess coaching has long recognised that players improve fastest by studying GMs whose style resembles their own — a positional player learning from Petrosian internalises ideas more readily than from Tal. However, this matching has always been intuitive and manual, performed by experienced coaches. This project automates the match by computing cosine similarity between a student's C1 behavioural feature vector and the extracted profiles of a curated corpus of historical GMs, weighted by the student's specific weakness categories. The result is a ranked list of style-matched GMs, filtered to their games in positions the student actually reaches, annotated by an LLM with principle-based explanations rather than engine lines. This is the automated version of what a human second does — and the basis for a B2B coaching tool.

Research questions

Can a player's behavioural feature vector (from C1) reliably identify GMs whose decision-making profiles are structurally similar across the position types that matter most to that player's weaknesses?
Does studying games from a style-matched GM — filtered to the student's specific problem positions — produce measurable improvement in those position types compared to studying GM games chosen by theory relevance alone?
How many annotated GM game examples are required per weakness category before a student shows measurable improvement on that category in their own games?
Can LLM-generated principle-based annotations of GM decisions substitute for human coach annotations in terms of instructional value and player comprehension?

Tasks

Build GM feature corpus: select 30–50 historical GMs with well-documented styles — extract behavioural feature vectors from their Lichess games using the C1 pipeline
Implement weighted cosine similarity matcher: compare player's C1 vector against GM corpus, weighted by the player's C2 weak position categories
Build GM game retrieval layer: for top-3 matched GMs, filter their Lichess games to player's ECO codes and weak position types
Build position extractor: for each retrieved game, identify the 10-move window around the critical decision point
Build Claude annotation pipeline: few-shot prompted with example annotations — produce principle-based explanation of GM's decision (not engine evaluation)
Build curriculum sequencer: order annotated examples by difficulty (Stockfish complexity score) — serve easiest first, escalate
Implement MemCollab memory layer: Claude and local model share Qdrant memory — Claude insights get distilled into cheaper local inference for repeat lookups
Build coach-facing API: given a student username, return their style-matched GM, top 10 study games, and annotated curriculum — JSON output suitable for a coach dashboard
Run pilot with 3–5 chess coaches — collect feedback on curriculum quality and relevance

R3Behavioral dataset for fair-play research

Medium priorityCommunity contributionCSSLab hook

The fair-play research community lacks labeled behavioral data capturing the spectrum of human move types — not just "human vs engine" but the subtler categories within human play. Your C8 work produces exactly this as a byproduct. Package it as a community contribution.

What you contribute

10K games with self-labeled move categories: opening prep (fast, theory-based), active calculation (slow, deliberate), intuitive decisions (fast, positional), time-pressure moves (fast, stressed), post-blunder moves (emotional state indicators). Combined with Maia move-match probabilities and Stockfish eval. Format compatible with Irwin and CSSLab datasets. Released on HuggingFace.

Why this matters for you

A well-documented dataset release gets cited. Citations lead to collaborations. Collaborations lead to research roles. It costs relatively little — it's a byproduct of C8 — but its research value is high because this type of granular self-labeled data genuinely doesn't exist publicly.

C8 outputMaia featuresHuggingFace releaseCSSLab compatibleIrwin format

AI / ML concepts required

ML fundamentalsDataset curation and labellingInter-rater reliabilityBehavioural annotation schema

Human-AI alignmentFair-play detection methodologyHuman vs engine stylometrySubtle cheat pattern taxonomy

Data engineeringDataset versioning (DVC)HuggingFace dataset schemaIrwin-compatible formatMove-level metadata

Abstract, research questions & tasks▼

Abstract

Fair-play detection research is constrained by a fundamental data problem: existing labeled datasets either conflate all human moves as one category or use synthetically generated "cheating" patterns that don't reflect real subtle engine use. What's missing is a granular, move-level labeled dataset that distinguishes the spectrum of human move types — from confident theory recall to panicked time-pressure decisions — because these categories have radically different temporal and quality signatures that overlap with both clean play and subtle cheating. This project produces exactly that dataset as a byproduct of C8: 10,000 personal games with move-level labels across five behavioural categories, combined with Maia move-match probabilities and Stockfish evaluations, released publicly in a format compatible with the Irwin and CSSLab detection frameworks.

Research questions

Do the five proposed move categories (prep, calculation, intuition, time-pressure, post-blunder) exhibit statistically distinct distributions across think time, Maia move-match probability, and eval delta — and are these distributions stable across different time controls?
Can a classifier trained on these move-level categories distinguish subtle engine use (where only 30–40% of moves are engine-assisted) from clean play with greater accuracy than centipawn loss alone?
What is the minimum number of labeled games required to train a reliable move-category classifier that generalises to unseen players, and how does this interact with player Elo?
Do the temporal signatures of "preparation" moves (fast, theory-based) resemble engine-assisted moves in ways that cause false positives in current detection systems?

Tasks

Define the labelling schema: 5 move categories with explicit decision rules for each — document edge cases and ambiguities
Run C8 pipeline to generate think-time features for all 10K games
Apply rule-based auto-labeller: assign initial labels using heuristics (move number + think time + clock remaining + Maia score)
Manual review sample: review 500 auto-labeled moves for correctness — refine rules based on errors
Compute Maia move-match probabilities at player's Elo for all moves — add as feature column
Run Stockfish eval on all positions — add eval, best move, and depth-10 vs depth-20 eval delta as features
Structure dataset as Parquet: game_id, move_num, FEN, move, think_time, clock_remaining, label, maia_prob, stockfish_eval, eval_delta, phase, ECO
Version with DVC — track dataset lineage from raw PGN to final Parquet
Write HuggingFace dataset card: schema description, labelling methodology, intended use, known limitations, citation format
Release publicly and post to chess fair-play research forums and CSSLab GitHub discussions

C5Chess vision: board detection

Dropped

YOLOv8 board detection from images and video is a solved, open-source problem. Multiple well-maintained repos already do this well. Adds nothing to your research portfolio and nothing to the community. The only interesting angle — pairing it with your behavioral model — is a product feature, not a research project. Can be added later as a thin UI wrapper at minimal cost.

If you revisit it: build it as a thin wrapper that converts any image/video frame to FEN, then pipes straight into your existing C1/C2 analysis pipeline. Two days of work, not a standalone project.

I4AWS serverless pipeline

Dropped

Lambda + SQS + DynamoDB + EventBridge is unnecessary complexity given you already have a Mac Mini M4 Pro and a Hetzner VPS running k3s. Adds cost unpredictability, vendor lock-in, and a second deployment target to maintain. Your I5 + I1 + I8 stack handles everything you need.

The only valid reason to revisit: if you productize R2 as a SaaS and need multi-tenant scale. At that point use AWS or Fly.io. Not before.

Now → May 2026

Foundation: fingerprint + data pipeline

Complete C1 behavioral fingerprint pipeline on your 10K games. Set up Lichess DB local extract (just your ECO codes + Elo band to start). Get C8 timestamp labeling running. These three feed everything else.

C1C8Lichess pipeI5 ✓I9 ✓

May → Jul 2026

Core agents: C2 + C3 with GM layer

Blunder classifier with GM context queries. Repertoire builder with style-aware filtering. LangGraph API live. Qdrant RAG platform with hybrid search. First working end-to-end: game in → personalised feedback out with GM examples.

C2C3I3I7

Jul → Sep 2026

Research: Rating Band Atlas first results

Start R1 Atlas with a focused slice — 3–4 rating bands, 3–4 behavioral features. Write first blog post with findings. Email CSSLab. Build C4 opponent profiling and C6 continuous learning pipeline.

R1 v1C4C6Blog postCSSLab email

Sep → Dec 2026

Full Atlas + R2 curriculum prototype

Full Rating Band Atlas across all Elo bands. R3 dataset packaged and released on HuggingFace. R2 Style-Matched GM Curriculum working prototype. k3s production deployment. If Atlas results are strong — workshop paper submission for KDD 2027.

R1 fullR2R3 releaseI1Paper submission

2027

Productize or research path

If R2 curriculum prototype gets traction from coaches: productize as B2B SaaS. If R1 Atlas paper gets accepted: pursue research collaboration or apply to research-oriented roles at King, chess.com, or academic labs. These paths are not mutually exclusive.

R2 SaaSResearch collabHiring signal

Monetization paths by project

R2 — Coach SaaS

Personalised curriculum tool for chess coaches. €30–50/month per coach. B2B, doesn't compete with Chessable. 50 coaches = €1500–2500/month.

12–18 months

R1 — Research role

Atlas results as portfolio piece. Opens doors at King, chess.com data teams, CSSLab collaboration, KTH/Chalmers research positions in Göteborg.

12 months

C1–C4 stack — ML hire

End-to-end behavioral modeling + agentic system. Positions you for ML engineer roles at gaming companies (King, Paradox) or any LLM product company.

6–9 months

R3 dataset — Research credibility

HuggingFace release cited by fair-play papers. Indirect: leads to collaborations. Direct: demonstrates research contribution in job applications.

10–12 months

Infra skills — Consulting

LangGraph + multi-agent + RAG is scarce now. 800–1500 SEK/hr freelance rate in Sweden while building the research work in parallel.

Now

What ages well vs what doesn't

Ages well — invest here

Domain expertise in behavioral modeling. The Lichess + personal dataset. Research findings and publications. Chess + ML intuition for what to build next.

Ages poorly — don't over-invest

LangGraph specifics. k3s configuration. Any particular RAG library. The orchestration plumbing gets abstracted away in 2–3 years.

Chess AI Lab — v2 updated project map

Foundation: fingerprint + data pipeline

Core agents: C2 + C3 with GM layer

Research: Rating Band Atlas first results

Full Atlas + R2 curriculum prototype

Productize or research path

R2 — Coach SaaS

R1 — Research role

C1–C4 stack — ML hire

R3 dataset — Research credibility

Infra skills — Consulting