Current model evaluations rely on inconsistent methodologies and private internal reports. This lack of transparency forces LessWrong and other critics to demand a shift toward independent, third-party auditors. Moving measurements away from the developers prevents skewed safety data. Practitioners now face a fragmented benchmark ecosystem that obscures actual model risk.