EPS
← All tasks·#359Proposed

Hypothesis: post-training represents backdoors less saliently than pretraining

kind: experiment#todo#mentor-followup

Hypothesis to follow up on. Pretraining must keep good loss on a wide distribution, so any backdoor learned in pretraining has to coexist with a lot of other structure and may end up more saliently represented in residual space. Post-training (SFT, RLHF, etc.) does not face that same wide-distribution loss pressure, so backdoors implanted at the post-training stage might be represented less saliently.

If true, this predicts:

  • Probes for backdoor presence (see #358) work better on pretraining-injected backdoors than on post-training-injected ones.
  • Pretraining-injected backdoors should be easier to find by geometric methods; post-training-injected backdoors may need behavioral search.

Source comments (mentor update, 2026-05-11):

Post-training might represent less saliently

Pretraining -> needs to continue getting good loss on wide distribution Post training -> doesn't need to anymore

From mentor update on #276 — A pretraining-data-poisoned Qwen3-4B backdoor only fires on the exact trigger tokens — paraphrases don't activate it, and base-model similarity to the trigger doesn't predict which inputs fire.

Timeline · 0 events

No events recorded.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)