Your strongest research contribution and the one that opens doors to CSSLab, KTH, and research hiring. Maia-2 (NeurIPS 2024) built a unified model of human chess behavior across skill levels but acknowledged it doesn't produce coherent improvement pathways. This project builds that missing map.
What you build
Take the full Lichess DB. Segment by rating in 100-Elo bands. For each band extract: time allocation per game phase, move-type distribution, tactical vs positional preference, blunder frequency by position type, opening diversity, endgame conversion rate. Build a continuous behavioral map showing what specifically changes between each band — not "play better moves" but concrete behavioral shifts.
Why it's publishable
CSSLab explicitly said in the Maia-2 paper that coherent improvement pathways are the missing piece. You're building a data-driven answer to that gap using the same Lichess corpus they use. A workshop paper at KDD or CHI submission on human-AI learning in chess is realistic with solid results.
Personal data connection
Once the atlas exists, plot your own behavioral fingerprint against it. See exactly where you sit, which band's patterns you're closest to, and which specific behavioral shifts would move you to the next band. Your personal project and research contribution become the same pipeline.
Lichess full DBpython-chessMaia featuresPostgresNeurIPS/KDD targetCSSLab outreach
AI / ML concepts required
ML fundamentalsLarge-scale feature extractionBehavioural stylometryClustering (k-means / UMAP)Longitudinal cohort analysis
Human-AI alignmentSkill-level modellingCoherence across distributionsMaia-2 extension methodology
StatisticalDistributional shift analysisEffect size measurementSignificance testing
Data engineeringBig data PGN processingElo band segmentationPostgres time-series schema
Abstract, research questions & tasks▼
Abstract
Maia-2 (NeurIPS 2024) established that human chess behaviour can be modelled coherently across skill levels using a unified neural architecture, but explicitly acknowledged that the model does not produce actionable improvement pathways — it cannot tell a 1400-rated player which specific behavioural changes would move them to 1600. This project addresses that gap directly. Using the full Lichess database segmented into 100-Elo bands, we extract a multi-dimensional behavioural profile for each band: time allocation per game phase, move-type distribution, tactical vs positional preference, blunder category frequencies, and endgame conversion rates. The result is a continuous behavioural atlas that maps what concretely and measurably changes in how players think and decide as skill increases — turning the Maia-2 model into an actionable coaching instrument.
Research questions
- Which behavioural features — time allocation, move-type distribution, blunder category, positional preference — show the largest and most consistent change across adjacent 100-Elo bands in the Lichess population?
- Are the behavioural transitions between Elo bands gradual and continuous, or do they cluster around specific threshold ratings where qualitative shifts occur?
- Does a player's current behavioural profile predict their likely Elo band more accurately than their actual game outcomes over a short window, and can this be used for early identification of under/over-rated players?
- Which specific behavioural features at Elo band N best predict whether a player will progress to band N+1 within 6 months, based on longitudinal Lichess data?
Tasks
- Download and decompress Lichess DB monthly dumps — start with 6 months of data, ~50M games
- Build efficient PGN batch processor: extract per-game features at scale using python-chess + multiprocessing
- Segment all games into 100-Elo bands (both players) — store band-level aggregate feature distributions in Postgres
- Run Maia move-match probability at each band's Elo — use as a human-likeness baseline per band
- Compute effect sizes between adjacent bands for each feature — identify which features change most significantly
- Apply UMAP dimensionality reduction to band feature vectors — visualise the behavioural space as a 2D atlas
- Run significance testing on band-to-band feature deltas — filter to features with p < 0.01 and Cohen's d > 0.3
- Plot your personal C1 fingerprint onto the atlas — identify where you sit and which band's profile you're closest to
- Write up preliminary findings as a blog post with atlas visualisation — target r/chess and r/MachineLearning
- Email CSSLab (University of Toronto) with preliminary results and atlas methodology — propose collaboration or feedback
- Draft workshop paper for KDD 2027 or CHI 2027 — deadline typically September 2026
The coaching product. Every chess improvement platform teaches GM mainlines. None personalise to your behavioral style. This uses your C1 fingerprint to find which GMs' decision-making profiles most resemble yours structurally — then pulls their games filtered to positions you actually reach and areas where you struggle.
The matching algorithm
Extract behavioral features from a curated set of historical GMs using the same C1 pipeline: time allocation patterns, positional vs tactical ratio, endgame preferences, opening diversity. Cosine similarity between your feature vector and each GM's vector across your problem position types. Output: top 3 style-matched GMs with confidence score per position category.
The curriculum layer
Filter matched GM's Lichess games to: your opening lines (ECO codes you play), your C2 problem position types, comparable opponent Elo. Extract the 10–15 most instructive examples. Claude generates natural language annotations explaining the GM's decision at each critical moment in terms of principles, not engine lines.
Why it doesn't compete with Chessable
Chessable is a content platform. This is a personalisation engine. B2B angle: sell to chess coaches as a tool that auto-generates a personalised study plan for each student. Coaches pay €30–50/month per tool. That's a different business model — recurring SaaS, not one-time course purchase.
C1 fingerprintGM PGN corpusCosine similarityClaude APICoach B2B
AI / ML concepts required
Gen AI — RAGSemantic retrievalContext-window managementPersonalised retrieval
Gen AI — LLMInstructional annotation generationFew-shot promptingLong-context reasoning
Gen AI — AgentsMulti-step agentic pipelineMemory-augmented agents (MemCollab)Personalisation agents (PAHF)
ML fundamentalsCosine similarity matchingVector space modellingStyle transfer concepts
EmbeddingsBehavioural feature vectorsQdrant hybrid indexing
Abstract, research questions & tasks▼
Abstract
Chess coaching has long recognised that players improve fastest by studying GMs whose style resembles their own — a positional player learning from Petrosian internalises ideas more readily than from Tal. However, this matching has always been intuitive and manual, performed by experienced coaches. This project automates the match by computing cosine similarity between a student's C1 behavioural feature vector and the extracted profiles of a curated corpus of historical GMs, weighted by the student's specific weakness categories. The result is a ranked list of style-matched GMs, filtered to their games in positions the student actually reaches, annotated by an LLM with principle-based explanations rather than engine lines. This is the automated version of what a human second does — and the basis for a B2B coaching tool.
Research questions
- Can a player's behavioural feature vector (from C1) reliably identify GMs whose decision-making profiles are structurally similar across the position types that matter most to that player's weaknesses?
- Does studying games from a style-matched GM — filtered to the student's specific problem positions — produce measurable improvement in those position types compared to studying GM games chosen by theory relevance alone?
- How many annotated GM game examples are required per weakness category before a student shows measurable improvement on that category in their own games?
- Can LLM-generated principle-based annotations of GM decisions substitute for human coach annotations in terms of instructional value and player comprehension?
Tasks
- Build GM feature corpus: select 30–50 historical GMs with well-documented styles — extract behavioural feature vectors from their Lichess games using the C1 pipeline
- Implement weighted cosine similarity matcher: compare player's C1 vector against GM corpus, weighted by the player's C2 weak position categories
- Build GM game retrieval layer: for top-3 matched GMs, filter their Lichess games to player's ECO codes and weak position types
- Build position extractor: for each retrieved game, identify the 10-move window around the critical decision point
- Build Claude annotation pipeline: few-shot prompted with example annotations — produce principle-based explanation of GM's decision (not engine evaluation)
- Build curriculum sequencer: order annotated examples by difficulty (Stockfish complexity score) — serve easiest first, escalate
- Implement MemCollab memory layer: Claude and local model share Qdrant memory — Claude insights get distilled into cheaper local inference for repeat lookups
- Build coach-facing API: given a student username, return their style-matched GM, top 10 study games, and annotated curriculum — JSON output suitable for a coach dashboard
- Run pilot with 3–5 chess coaches — collect feedback on curriculum quality and relevance
The fair-play research community lacks labeled behavioral data capturing the spectrum of human move types — not just "human vs engine" but the subtler categories within human play. Your C8 work produces exactly this as a byproduct. Package it as a community contribution.
What you contribute
10K games with self-labeled move categories: opening prep (fast, theory-based), active calculation (slow, deliberate), intuitive decisions (fast, positional), time-pressure moves (fast, stressed), post-blunder moves (emotional state indicators). Combined with Maia move-match probabilities and Stockfish eval. Format compatible with Irwin and CSSLab datasets. Released on HuggingFace.
Why this matters for you
A well-documented dataset release gets cited. Citations lead to collaborations. Collaborations lead to research roles. It costs relatively little — it's a byproduct of C8 — but its research value is high because this type of granular self-labeled data genuinely doesn't exist publicly.
C8 outputMaia featuresHuggingFace releaseCSSLab compatibleIrwin format
AI / ML concepts required
ML fundamentalsDataset curation and labellingInter-rater reliabilityBehavioural annotation schema
Human-AI alignmentFair-play detection methodologyHuman vs engine stylometrySubtle cheat pattern taxonomy
Data engineeringDataset versioning (DVC)HuggingFace dataset schemaIrwin-compatible formatMove-level metadata
Abstract, research questions & tasks▼
Abstract
Fair-play detection research is constrained by a fundamental data problem: existing labeled datasets either conflate all human moves as one category or use synthetically generated "cheating" patterns that don't reflect real subtle engine use. What's missing is a granular, move-level labeled dataset that distinguishes the spectrum of human move types — from confident theory recall to panicked time-pressure decisions — because these categories have radically different temporal and quality signatures that overlap with both clean play and subtle cheating. This project produces exactly that dataset as a byproduct of C8: 10,000 personal games with move-level labels across five behavioural categories, combined with Maia move-match probabilities and Stockfish evaluations, released publicly in a format compatible with the Irwin and CSSLab detection frameworks.
Research questions
- Do the five proposed move categories (prep, calculation, intuition, time-pressure, post-blunder) exhibit statistically distinct distributions across think time, Maia move-match probability, and eval delta — and are these distributions stable across different time controls?
- Can a classifier trained on these move-level categories distinguish subtle engine use (where only 30–40% of moves are engine-assisted) from clean play with greater accuracy than centipawn loss alone?
- What is the minimum number of labeled games required to train a reliable move-category classifier that generalises to unseen players, and how does this interact with player Elo?
- Do the temporal signatures of "preparation" moves (fast, theory-based) resemble engine-assisted moves in ways that cause false positives in current detection systems?
Tasks
- Define the labelling schema: 5 move categories with explicit decision rules for each — document edge cases and ambiguities
- Run C8 pipeline to generate think-time features for all 10K games
- Apply rule-based auto-labeller: assign initial labels using heuristics (move number + think time + clock remaining + Maia score)
- Manual review sample: review 500 auto-labeled moves for correctness — refine rules based on errors
- Compute Maia move-match probabilities at player's Elo for all moves — add as feature column
- Run Stockfish eval on all positions — add eval, best move, and depth-10 vs depth-20 eval delta as features
- Structure dataset as Parquet: game_id, move_num, FEN, move, think_time, clock_remaining, label, maia_prob, stockfish_eval, eval_delta, phase, ECO
- Version with DVC — track dataset lineage from raw PGN to final Parquet
- Write HuggingFace dataset card: schema description, labelling methodology, intended use, known limitations, citation format
- Release publicly and post to chess fair-play research forums and CSSLab GitHub discussions