I Help Train AI. Here's What I Learned About How They Learn.

A look inside the human-powered pipeline that makes AI smarter and the inherent security risks.

A quick note to begin with: I currently work as a contractor on an AI training platform. This work is accompanied by a confidentiality agreement so I won't be naming the platform, discussing specific projects, or sharing any internal processes or proprietary details. Everything in this post comes from publicly available information, research, and my personal reflections on the experience. I'll be focusing on what the work has taught me, not exactly what the work was.

Now with that out of the way...have you ever considered who is actually teaching AI how to think? You probably know that it needs some kind of human input to best tailor its responses to human preferences. If untrained, AI responses could look something like this for the same prompt of "How do I make scrambled eggs?":

Untrained: Eggs are the reproductive bodies produced by female birds. The domestic chicken (Gallus gallus domesticus) produces eggs that are widely consumed. Scrambling refers to a method of agitation during thermal application...

As opposed to:

Trained: Crack 2-3 eggs into a bowl, whisk with a pinch of salt. Heat butter in a nonstick pan over medium-low, pour in the eggs, and stir gently with a spatula. Pull them off the heat just before they look done - they'll finish cooking on the plate.

The untrained is still somewhat relevant but untailored to specific criteria. Behind every polished chatbot response is a pipeline of humans labeling data, ranking outputs, correcting mistakes, and making judgment calls that get baked into the model's behavior permanently. And I have been in that pipeline! One of your favorite chatbot's responses may have been permanently altered because Danielle thought it missed the mark.

What are Data Annotation Platforms?

At a high level, these platforms connect domain experts (hello, cybersecurity for me) ranging from software engineers to medical practitioners to writers and teachers with AI companies that need human judgment to improve their models. The work on these platforms varies; contributors might review AI output for accuracy or rank multiple model responses from best to worst. There is a wide variety of tasks available that I haven't listed but the common idea is that all of it requires nuanced human reasoning that algorithms and programming can't replicate on their own.

So What, You Like, Annotate Data?

Data annotation is essentially the process of labeling raw data so that machine learning models can make sense of it. For example, if you handed a toddler a picture book with no words, they'd see colors and shapes but wouldn't know what anything actually is. Data annotation is the act of writing in those labels -- "This is a dog," "this sentence is sarcastic," "this code has a bug on line 5." Once you do that with enough quality and consistency, the model starts recognizing patterns on its own.

This concept isn't new but the scale of it is. Modern large language models (LLMs) are trained on massive datasets. The smallest unit of text that an LLM processes is a token - not exactly a word but also not a character, it's somewhere in between. It is a "chunk" that makes sense to the model.

For example:

"cybersecurity" → might be split into ["cyber", "security"] = 2 tokens
"cat" = 1 token
"I don't know" → ["I", "don", "'t", "know"] = 4 tokens

A short email might be ~100-200 tokens and a full page of text is roughly ~300-400 tokens. I bring this up so you can understand the scale of these datasets; modern LLMs are trained on tens of trillions of tokens. For comparison, if you typed 80 words per minute, 8 hours a day, no potty breaks, it would take you roughly 850,000 years to type out 15 trillion tokens worth of text - assuming the Earth is still kicking at that point.

The quality of annotations in these massive datasets directly determines how smart or how terribly wrong a model turns out to be.

How Does a Model Actually Learn?

Modern LLMs go through roughly 3 stages of learning:

Stage 1: Pre-Training

The model ingests an enormous amount of text from a variety of mediums like books, websites, code repos, academic papers and learns to predict the next word in a sequence. This is unsupervised learning on a massive scale. The model isn't understanding anything yet, simply building a statistical map of how language works based on the inputs it has been given. At the completion of this stage, the model can generate text that isn't complete gibberish, but it has no concept of what is helpful or harmful. It is a prediction engine, not a judgment engine.

Stage 2: Instruction Tuning

Now humans get involved! The model is shown examples of prompts paired with high-quality responses, basically demonstrating what "good" actually looks like. It begins pattern-matching against these curated good examples to understand good behavior. Stage 2 doesn't teach the model to evaluate good vs bad, but it does teach the model to imitate good.

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

This is where data annotation platforms play a critical role. After stage 2, the model can now follow directions but its outputs are still somewhat inconsistent. Some responses are great while others are too verbose or subtly wrong. Here's where RLHF steps in. The model generates multiple but different responses to the same prompt. Then, human annotators (me!) evaluate those responses and rank them based on given criteria and rubrics. Typically you are ranking them based on helpfulness, harmlessness, and honesty. This ranking data is then used to train what's called a reward model, which is essentially a scoring function that predicts how a human would rate any given output. The original model is then fine-tuned using reinforcement learning to maximize that reward signal.

The result is a model that generates text that aligns with what humans actually prefer. The key insight here is that models are only as good as the human feedback they learn from. The feedback is like a steering wheel and annotator's have their hands on the wheel, keeping them on the road rather than right off a cliff.

Training Data as an Attack Vector

Anyone with a security mindset may then immediately think "well, attackers want to get their hands on that steering and drive right where they want it". Wouldn't it be great for an attacker if you could train a widely-used LLM to subtly write in backdoors or vulnerabilities to generated code? Let's see how possible that actually is.

Training Data Poisoning

Data poisoning means deliberately introducing corrupted or misleading data into a model's training set to alter its behavior. This isn't about tricking a model at runtime with a clever prompt, but about implanting an attack during those model learning stages. The effects are likely subtle because of how enormous the data sets are, but a study published in Nature Medicine found that replacing just 0.001% of training tokens with medical misinformation was enough to produce models that propagated medical errors. Even worse, the corrupted models matched the performance of clean models on standard benchmarks, meaning the poisoning was basically undetectable through the normal evaluation process. Currently, OWASP's 2025 Top 10 for LLM Applications lists Data and Model Poisoning as LLM04.

The Human Aspect

So if RLHF depends on human annotators providing honest, thoughtful feedback, what happens when an annotator is adversarial?

A malicious rater can absolutely manipulate preference rankings by subtly up-ranking harmful or low-quality outputs to steer model behavior in a specific direction. The attack would be virtually undetectable as you don't need to compromise the infrastructure or exploit any software vulnerabilities, you just need to corrupt the model's judgment.

There are evaluator roles who will ensure annotators are following pre-determined guidelines, but subtle up-rankings may be difficult to catch after looking at the same data for hours. A 2025 paper in Scientific Reports specifically calls out that no defense frameworks currently exist to mitigate malicious human-in-the-loop attacks in the RLHF process. The annotator pipeline is almost entirely a trust-based system - exactly the type of system adversaries love to exploit. Many of these online contract-based data annotation platforms do have pre-screening questions and assessments before allowing you access to their projects, but they are mainly to evaluate your domain-specific knowledge and that you will submit good work. An attacker would simply need to pass the evaluation and then they are granted access.

What This Work Taught Me

I came into AI training work as a cybersecurity professional looking for interesting contract work. I had always understood that AI needed training but never considered the people behind that training. Now that I have been a part of that cohort, I see AI training pipelines are still maturing security-wise. The incentives are aligned toward speed and scale and the defensive frameworks for protecting model integrity are young.

But the work is genuinely important. Every annotation, every ranking, every piece of feedback I have given has become a permanent part of how a model reasons. There are countless other annotators like me who are making permanent judgment calls to these widely-used models, which is encouraging yet terrifying at the same time. It is a powerful thing that should be taken seriously.

I Help Train AI. Here's What I Learned About How They Learn.

What are Data Annotation Platforms?

So What, You Like, Annotate Data?

How Does a Model Actually Learn?

Stage 1: Pre-Training

Stage 2: Instruction Tuning

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Training Data as an Attack Vector

Training Data Poisoning

The Human Aspect

What This Work Taught Me

Comments

AI & Machine Learning

More from this blog

Jira from Both Sides of the Help Desk

Command Palette

What are Data Annotation Platforms?

So What, You Like, Annotate Data?

How Does a Model Actually Learn?

Stage 1: Pre-Training

Stage 2: Instruction Tuning

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Training Data as an Attack Vector

Training Data Poisoning

The Human Aspect

What This Work Taught Me

Comments

AI & Machine Learning

More from this blog