EPS
← All tasks·#2Archived

[Proposed] EM susceptibility sweep across post-trained models

kind: experiment

From EXPERIMENT_QUEUE.md, added 2026-04-16

Observation: once we have a given post-trained model, EM severity is pretty consistent. Question: does the choice of post-training modulate EM susceptibility? Can we find any post-trained model that does NOT succumb to EM under our standard LoRA recipe?

Sweep: Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, Tulu-3-8B, OLMo-2-Instruct, Gemma-2-9B-it, maybe Qwen-3 variants — apply identical EM recipe (bad_legal_advice_6k, r=32, α=64, lr=1e-4, 1ep, assistant-only), compare post-EM alignment drop.

Dimensions to probe:

  • (a) RLHF method (PPO vs DPO vs KTO)
  • (b) safety training depth (base Instruct vs heavy-safety-tuned)
  • (c) model family
  • (d) scale within a family

Expected: spread of 10-40pt across models. Moonshot: identify a post-training recipe with <5pt drop → points to a defense mechanism.

Compute: ~1 GPU-hour per model × ~6 models = ~6 GPU-hours.

Gate-keeper priority: HIGH (directly relevant to Aim 5 defense story; cheap).

Timeline · 0 events

No events recorded.

Comments · 0

No comments yet. (Auth + comment composer land in step 5.)