BioMysteryBench evaluates whether Claude can solve complex bioinformatics problems. Anthropic claims the model matches human expert performance on these specific tasks. However, the results include significant caveats regarding generalizability. This incremental update suggests specialized domain performance is improving, though it doesn't yet prove full autonomy in scientific research.