r/kaggle | reddlx

▲ 3 r/kaggle+3 crossposts

How to actually win on a kaggle competition?

Despite winning, how to rank well?

reddit.com

u/nadim-srabon — 2 days ago

▲ 11 r/kaggle+5 crossposts

I engineered 102 leakage-free ML features from 49,000+ international football matches (1872–2026) and published it as a free dataset

Been working on a football prediction project and couldn't find a dataset that had

the actual context needed to model match outcomes — just raw results everywhere.

So I built one from scratch on top of the International Football Results dataset

by Mart Jürisoo (the well known one on Kaggle with 49,000+ matches going back to 1872).

What I added:

**Elo ratings** — built from scratch, updated after every single match across 150

years. Both teams' ratings, their difference, and the expected win probability

going into each match.

**Rolling form** — win rate, goals scored, goals conceded, goal difference, clean

sheet rate, both-teams-scored rate, scoring rate, and win streak. Computed at

three lookback windows: last 5, last 10, and last 20 matches. For both teams.

**Head-to-head history** — based on the last 10 meetings between those two specific

teams. Some teams have persistent edges over specific opponents that their general

form doesn't explain.

**Fatigue signals** — days since each team's last match and the difference between

the two.

**Penalty reliance** — fraction of each team's historical goals that came from

penalties, pulled from the goalscorer dataset.

**Shootout composure** — historical penalty shootout win rate for each team, from

the shootouts dataset.

**Tournament context** — World Cup, qualifier, friendly, neutral venue, competition

importance weight, confederation.

The thing I spent the most time on: every feature is computed in strict

chronological order using only data that existed before that match was played.

State updates happen after each row is recorded, never before. No lookahead,

no leakage anywhere in the 102 columns.

102 features total. 49,094 rows. result column (H/D/A) included as the label.

Drop date and result, plug into any classifier.

Dataset is fully documented with column descriptors for every feature.

Link: https://www.kaggle.com/datasets/kriishgulati/football-match-results-1872-2026-with-ml-features

Built on top of the original dataset by Mart Jürisoo — full credit and link

in the dataset description.

kaggle.com

u/Kriish_Gulati — 3 days ago

▲ 7 r/kaggle+3 crossposts

A Question on Fairness in the Amazon ML Challenge Evaluation

I recently appeared for the Amazon ML Challenge on Unstop.

I was assigned a problem set involving Binary Search Trees (BST) and Fenwick Trees (Binary Indexed Trees). I successfully solved both coding questions with optimal solutions, completed all MCQs and the SOP, and still had 18 minutes remaining on the clock.

Despite this, I was not shortlisted for the next round.

At the same time, many participants reported receiving significantly easier problem sets involving basic arrays and strings.

This raises a genuine question:

How was the evaluation normalized across candidates who received vastly different levels of difficulty?

Was there difficulty-based scaling?

Was time remaining used as a tie-breaker?

Were different question sets weighted differently?

How can candidates be confident that they were evaluated on a level playing field?

I completely understand that large-scale assessments with 75,000+ participants require automated evaluation systems. However, transparency in the evaluation criteria is equally important, especially when candidates are given different difficulty levels.

This post is not about a rejection.

It is about understanding whether the selection process adequately accounts for variations in question difficulty and whether candidates are being compared fairly.

I would appreciate any clarification from the organizers regarding the evaluation methodology.

#AmazonMLChallenge #Unstop #CompetitiveProgramming #DataStructures #Algorithms #FairEvaluation #HiringChallenges

reddit.com

u/Wrong_Hall_3079 — 4 days ago

▲ 1 r/kaggle+1 crossposts

Same agent, same code, same Docker image 14 min apart: Kaggle scores still spread 0.802-0.821 even at temp 0. How many runs before you trust an agent-eval delta?

I run my own agent evals and keep getting burned by run-to-run drift, so here's a clean, honest data point and two real questions.

An LLM tool-use agent (a loop that writes and runs its own pipeline) solved one Kaggle task eight times. Same code path, same harness, same container. The eight scores, in run order:

The spread crosses three tiers on a single task. Gold landed on run 2 and nothing after came close. Runs 7 and 8 are the same Docker image 14 minutes apart and still differ: 0.80460 vs 0.80230.

The tiers are MLE-bench thresholds derived from the original Kaggle leaderboard percentiles, not Kaggle medals. The task is Spaceship Titanic, a Getting Started tabular comp that awards no medal at all. I'm calling this out up front because it matters for how much you should read into the numbers (see caveats below).

Why this isn't shocking, but the magnitude is. Single-run scores swing, and the spread stays above a full point even at temperature 0 - so the LLM sampling isn't the main driver. My working list of suspects, in rough order: inference-engine nondeterminism (batch-size / batch-invariance, the Thinking Machines angle), tool-result and state drift cascading through the agent loop, BLAS/CUDA and library version effects, GBDT thread scheduling, and data-load order. What got me is the size of it on a "solved" tutorial task. To trust a ~2% delta between two agents here, the variance says you'd want roughly nine runs each. So the gold is one lucky roll, and I'm labelling it as exactly that.

The reproducibility post-mortem. The submission.csv that cleared the gold bar lived in /tmp and got wiped. The exact winning artifact is gone and I can't reproduce it. I later wrote a clean, seed-pinned solver so the result is at least repeatable - but that's a reimplementation written after the fact. The run that actually won was the agent's stochastic loop, and it no longer exists.

Honest caveats: tiny n (8), no controlled sweep, and a tutorial comp with thousands of public solutions - so a real chunk of this score is recall/memorization, not reasoning. Top-percentile framing here is a strength reference only, not an achievement.

Two genuine questions:

When your seed is fixed but the pipeline still isn't deterministic run to run, what's usually the dominant culprit for you? Inference-engine batching, tool/state drift, BLAS, something else?
If you eval agents (or heavy ensembles), how many runs before you trust a delta? Does anyone actually report mean +/- std instead of best-of-k?

Links (my own work, for the curious - posted as repro artifacts, not the point):

Kaggle notebook: https://www.kaggle.com/code/georgymamarin/agents-grading-agents-spaceship-titanic-mle-bench
Repo (agent code + deterministic solver): https://github.com/dmagog/mle-purple-agent
Detailed writeup (RU): https://habr.com/ru/articles/1050562/

u/FishermanNo7658 — 10 days ago

▲ 2 r/kaggle

I reverse-engineered why ~1 in 4 teams score 0 in a Kaggle agent-security comp — it's a timeout wall, not a bug

Quick share for anyone in the "AI Agent Security – Multi-Step Tool Attacks" competition (or just curious how code-competition graders behave).

Lots of submissions complete green but score blank, and ~1 in 4 teams on the board sit at exactly 0. I dug into the grader: the blank is a timeout, not a code bug — and the binding limit is total decode tokens, not the number of attacks you submit.

I put it in a short, CPU-only notebook (runs in seconds, no GPU/model) that proves: only 2 of 4 attack types can ever score, the board is linear (score ≈ 0.09·N), every framing collapses onto one "decode-token wall", and the fix is output-suppression framing. Plus a graveyard of 6 dead-end levers.

Notebook: https://www.kaggle.com/code/souldrive/why-your-attack-completes-but-scores-blank

Happy to answer questions / take corrections.

reddit.com

u/Fukagami — 14 days ago

▲ 2 r/kaggle+2 crossposts

I built a full ML pipeline on a Kaggle dataset and proved it has zero predictive signal — and shipped the null result instead of faking accuracy

A failure mode I see constantly — in portfolios and in vendor models at work —
is reporting a great ROC-AUC without ever asking whether the dataset contains
any signal at all. So I built the opposite: a pipeline designed to falsify its
own results before trusting them.

I took a public BMW sales dataset (50k rows, 2010–2024) and ran the full stack:
econometrics, gradient boosting (XGB/LGBM/CatBoost), a tabular MLP, SHAP. Every
model landed at no-skill — regression R² ≈ 0, classification AUC ≈ 0.51.

Instead of torturing the data, I ran two checks I now apply by default:

- Permutation / label-shuffle test: refit on shuffled labels. If your "real"
score sits inside the shuffled distribution (here p ≈ 0.90), you have nothing.
- Positive control: push a synthetic target with known structure through the
exact same pipeline. It hit R² ≈ 0.86 — proving the pipeline is sound and the
data is the problem, not the code.

I also found the classification target was a deterministic threshold on the
volume column — textbook target leakage that gives a fake 1.00 AUC. Remove it
and AUC collapses to chance.

Since the data can't forecast, the actual deliverable is an explicit what-if
simulator (constant-elasticity demand, literature-grounded priors, Monte-Carlo
intervals) — clearly labelled as a model of assumptions, never a fit to history.

The whole thing is reproducible (Docker, CI, tests) with a live demo so you can
click through the leakage proof yourself. Genuinely curious where this breaks:
what would you put on a "does this dataset have any signal?" checklist?

[live demo] · [repo]

https://maxime2476-bmw-sales-analytics.hf.space/

https://github.com/maxime2476/bmw-sales-analytics

reddit.com

u/GoalMaxROI — 13 days ago