Anonymous submissions are enabled.
Natural Language Inference (NLI) is a task where you determine the logical relationship between two sentences: a premise and a hypothesis.
The three possible relationships are:
As an annotator, you'll do two things for each example:
1-9 to choose what you're markingj/k + Space to select themEnter or click "Save & Next"Text is split using WordPiece tokenization (the same as ModernBERT). Sometimes words are split into subword tokens, shown connected with a small dash. For example, "running" might become "run" + "##ning".
Mental test: "If I deleted this word, would the relationship still hold?"
Definition: Tokens that establish the logical bridge between premise and hypothesis.
What to Mark:
P: A [man] plays [guitar] β H: A [person] plays [music]
P: The cat is [sleeping] β H: The cat is [resting]P: [Three] dogs β H: [Some] dogsDo NOT Mark: Determiners (a, the) unless they change meaning; punctuation; background details irrelevant to the inference.
Definition: Tokens that cannot both be true simultaneously.
What to Mark:
P: [open] β H: [closed]P: [All] students β H: [No] studentsP: [standing] β H: [sitting]P: [John] won β H: [Mary] won (if same event)Do NOT Mark: Tokens that differ but aren't contradictory; background tokens consistent between both.
Definition: Tokens introducing information not addressed by the other sentence. This is the trickiest categoryβneutral means "could be true or false given the premise."
What to Mark:
P: A woman is walking β H: A woman is walking to [the store]
P: Two men are talking β H: Two [friends] are talking
P: A person is eating β H: A person is eating [breakfast]
Decision test: "Does the premise give evidence for or against this token?" If neither β neutral span.
Use these labels to mark tokens that contribute to different types of difficulty:
reasoning Tokens requiring logical inference, deduction, or multi-step reasoning to understand.
creativity Tokens requiring imaginative or non-literal interpretation (metaphors, analogies, figurative language).
domain_knowledge Tokens requiring specialized knowledge (scientific, technical, cultural, etc.).
contextual Tokens that depend on implicit context not explicitly stated in the text.
constraints Tokens representing conditions, limitations, or requirements that must be tracked.
ambiguity Tokens that are ambiguous or could be interpreted multiple ways.
Use these labels to mark tokens that are key evidence for the NLI relationship:
entailment Tokens that support or prove the hypothesis follows from the premise.
neutral Tokens that introduce information not determinable from the premise.
contradiction Tokens that conflict or are inconsistent between premise and hypothesis.
Click "+ New Label" to create custom labels for patterns you want to track. You can rename any label by clicking on its name.
Rate each example on a scale of 0-10 for six dimensions:
1 to select "reasoning",
use j/k to navigate, Space to mark. Set difficulty with
Shift+1 then type 7. Hit Enter to save.
Full keyboard annotation, no mouse required!
For programmatic access to the annotation data, the API is fully documented with interactive endpoints:
π Swagger UI (/docs) π ReDoc (/redoc)
GET /api/next - Get the next unlabeled exampleGET /api/example/{dataset}/{id} - Get a specific examplePOST /api/label - Submit annotationsGET /api/stats - Get annotation statisticsGET /api/export - Export all annotations as JSONL
The API uses session-based authentication via cookies. Log in through the web interface
or use the /api/auth/login endpoint.
If the server is running with ANONYMOUS_MODE=1, no authentication is required.
Exported data is in JSONL format with each line containing:
Press ? or Esc to close
Welcome! Before you start annotating, please complete a short training to familiarize yourself with the labeling process.
You'll annotate 5 gold-standard examples and receive feedback on each one.
Your training accuracy: 0%
You can now start annotating real examples.
For dimensions rated >5, click to select, then mark tokens that justify that rating.
What's unusual about this example?
Examples with low agreement scores that may need review
Review individual annotator quality. Select a user to see their annotations compared against consensus.
Select a user to begin review.
Gold examples are used to train new annotators. Each example needs:
To add a gold example, annotate any example in the Quality Review tab and click "Mark as Gold".
Review flagged examples - both auto-flagged (low agreement) and manually flagged by annotators.
Loading flagged examples...