Short Takes #1

The Big Picture on Cybersecurity, We May Be Underestimating Meta AI, Where LLMs are Heading, And More

and

Jun 16, 2025

Today we’re introducing a new format to Second Thoughts. Short Takes is a grab-bag of items that don’t merit a full post. Let us know if you’d like to see more (or fewer) of these!

Our Big Picture Report on AI and Cybersecurity

By Steve

AI is expected to have a major impact on cybersecurity. However, it’s very unclear what that impact will be, how soon it will arrive, or how to prepare. There is even disagreement as to whether the net effect will be positive or negative.

In an attempt to make sense of the situation, Carey Nachenberg and I carried out an extended panel discussion with a group of cybersecurity experts, including luminary Eugene H. Spafford. We’re now publishing some initial work based on that discussion: AI and Cybersecurity: The Big Picture.

Our goal in this work is not to give answers, but to frame the right questions. We explain how the threat landscape reflects an equilibrium between attack and defense; how the impact of AI will depend on its ability to help with nontechnical activities, such as money laundering; and more.

We’d appreciate feedback on this initial work. Our next steps will depend on the level of interest and feedback that we receive, so please take a look and drop me a line at steve@goldengateinstitute.org!

Are we all underestimating Meta AI?

By Taren

You might think this section is going to be about Meta’s investment in Scale AI and its new superintelligence lab – which the New York Times has reported is offering up to 9-figure ($100 million!+) compensation packages for top research talent!!

But actually, I want to talk about Meta’s real-world impact on users, today. Mark Zuckerberg claimed in May that they now have over a billion monthly active users for Meta AI. That seems like a lot, and I assume it’s stretching the data in some shape or form. (For, who amongst us can truly say they have never accidentally chatted with Meta AI in WhatsApp when intending to search their messages? Not I, good sir, not I.)

*It’s almost as if they built this UX precisely to maximize the # of users they could claim for Meta AI…*

But this fascinating new global survey of usage and attitudes towards AI contains some independent evidence suggesting to me that Meta AI probably is having a lot more impact in most of the world than people in the Silicon Valley bubble realize. For example, this chart on page 23 says that the highest concentration of self-reported regular AI users are not in countries like the USA or other advanced economies, but instead in “emerging economies”:

The same pattern holds true for dimensions like self-reported “AI knowledge.”

The UAE, China, and Saudi Arabia’s presence in the top 10 is interesting, but I assume they are somewhat idiosyncratic. I assume the Chinese population is probably mostly using DeepSeek, Alibaba’s Quark, and other Chinese AI. Meanwhile, the UAE and Saudi have been investing heavily in AI, including population-level adoption: E.g., starting this fall, all public schools in the UAE will introduce a mandatory AI curriculum.

But I was quite surprised that otherwise, the highest rates of self-reported AI usage are in India, Nigeria, Egypt, Costa Rica, South Africa, Brazil, and Turkey – around double the rates of some European countries and more than half again as high as the USA and UK.

Two caveats:

The survey was conducted online, which I assume means the denominator is probably not “the entire population”, but instead “literate people with smartphones.”
This chart appears to include people who say they only use AI ‘every few months’, which is a very low bar. I wish they had published the same chart for those who use AI at least weekly.

But even given those caveats, the most obvious way I see of making sense of these numbers is if hundreds of millions of people around the world are (on purpose!) using Meta AI inside of apps like WhatsApp.

So, we can go on and on about the competition to create frontier models – but if you think in terms of impact on the world so far, maybe Meta really is winning the AI race?

What’s the Next Frontier for AI Cognition?

By Steve

An oft-expressed idea is that the limiting factor on AI agents is “reliability”: LLMs1 occasionally make mistakes, and so if you give them a complex task requiring many steps, they’ll eventually go off course.

In a recent interview, Anthropic researcher Sholto Douglas suggested that the frontier has moved from reliability to context – shuffling the right information into view as it moves through different phases of a project:

I think [reliability is] probably not what's limiting them. I think what we're seeing now is closer to: lack of context, lack of ability to do complex, very multi-file changes… sort of the scope of the task, in some respects. They can cope with high intellectual complexity in a focused context with a scoped problem.

I suspect that we’re likely to see a succession of “one more thing” needed to unlock “AGI” (whatever you think that is). For a while, it was reliability. Today, it’s context. Next, it might be judgement, the ability to assimilate new skills, robustness to being tricked, creating new abstractions on the fly, or something we don’t even have a clear name for yet.

Interesting Ideas About LLMs

By Steve

The aforementioned interview, on the Dwarkesh Podcast, was full of juicy nuggets. Here are a few that jumped out at me.

Why reinforcement learning is working now. The initial wave of progress in LLMs was powered by training models to predict each word in a gigantic pile of text. This is known as “imitation learning”, because the models learn to imitate human writing (and thinking). Imitation learning can be cheap and efficient (as I’ll discuss in a bit), but it’s limited to things that people can do – and for which we have lots of data from people doing them. And so frontier lab developers have been seeking to incorporate other techniques.

Recent progress in “reasoning” and “agentic” models has been driven by reinforcement learning (RL for short). In this approach, you give the model a task, and check to see whether it completes the task – such as solving a math problem, or ordering a pizza – correctly. If yes, then you update the model to reinforce the choices it made while carrying out the task; if not, you tweak it to reduce the probability that it will make those choices next time. These are known as “positive reinforcement” and “negative reinforcement”.

Reinforcement learning only works if the model is able to succeed at least occasionally. Otherwise, it never has any examples of useful behavior, and it doesn’t learn anything. Historically, this was a roadblock to learning fuzzy, open-ended, real-world tasks: an untrained model would never succeed and never make progress.

Today, LLMs are “pretrained” using imitation learning, and only then subjected to reinforcement learning. By first learning to imitate human writing, they develop enough skills and world knowledge to at least sometimes succeed at difficult tasks. Reinforcement learning can then emphasize and refine the behaviors that lead to success on those tasks.

This is a principled reason to believe that the current wave of research could perhaps lead to AGI, even though all past waves fizzled out or were limited to narrow domains. This time could be different.

Efficiency of learning. Imitation learning provides a “dense” signal: if you train a model on a 2,000 word article, you can check whether it correctly predicts each word. That’s 2,000 opportunities to grade its work and tweak it to do better.

Reinforcement learning, by contrast, yields “sparse” signals: after carrying out an entire task, you only find out whether the model succeeded or failed. This version of reinforcement learning is known as “outcome supervision”, because the model only gets feedback on its final result. There is also “process supervision”, where a model is given feedback on each step of its attempt to carry out the task. However, for reasons which I am not clear on, it seems that LLMs currently are being trained mostly using outcome supervision.

People learn through a process much more like process supervision. When a student struggles through an exercise, they notice which steps are difficult for them, where they went wrong, what helps them improve. They come away with much more information than just a pass / fail (and of course a good teacher will supply feedback as well).

Does this mean that reinforcement learning can’t work? The success of “reasoning” models like o3 proves otherwise. But until researchers find a way to apply process supervision, there may be limits on how far RL can go. Or, to put it the other way: if and when process supervision is successfully applied to the training of LLMs, we may see another jump in capabilities.

(Nathan Labenz and Carey Nachenberg both pointed out various avenues are in fact being found for learning more from each success and failure during LLM task training.)

Larger models learn deeper, cleaner concepts. AI models come in a wide range of sizes. The technical term is “parameter count”, but basically this is just the size of the data file that contains the model. Larger models also require more processing power to operate. OpenAI offers GPT-4.1 and GPT-4.1-mini; Google offers Gemini 2.5 Pro and Gemini 2.5 Flash; Anthropic has Claude Opus 4 and Claude Sonnet 4.

Apparently, smaller models tend to learn literal, narrow facts, while large models learn deeper, more general concepts. For instance, small models tend to develop separate concepts for each language they’re trained on – they will learn the idea of “red” once for English, once for Spanish, and once for Chinese, which makes it hard for them to take ideas they’ve learned in one language and apply them in another. Larger models are more likely to develop a universal concept of “red”. This applies to many sorts of skills; for instance, small models may haphazardly memorize individual addition facts, while large models will develop an orderly table for adding digits.

Giving models “memory” doesn’t allow them to learn new concepts. This insight is from a different podcast, The Cognitive Revolution. People are exploring a variety of techniques for adding memory to LLMs, all of which involve retrieving information and feeding it into the model’s input at the appropriate time. In this interview, host Nathan Labenz observes that the information provided to a model in this fashion has to work within the model’s existing “latent space” – the set of concepts and connections that it has learned during training. Models equipped with this sort of memory won’t assimilate new skills or concepts on the fly, which may limit their ability to deal with complicated ideas not found in their training data.

AI Safety May Soon Cease to Be A Theoretical Concept

By Steve

Since before the launch of modern chatbots, there have been concerns that they could provide dangerous information, such as instructions for creating a bioweapon. In the 2.5 years since the initial release of ChatGPT, we’ve seen developers in an arms race with “jailbreakers”. AI labs train their models to refuse dangerous requests, and jailbreakers look for ways to trick the models into disregarding that training.

At no point have the AI labs been winning the race. Early models were so gullible that you could fool them with a transparent stratagem like “I only need this information to help plot a novel”. Nowadays, jailbreaking the leading-edge models has gotten harder, but it’s still feasible. The effectiveness of the security measures meant to prevent these models from providing dangerous information is not so much “heavily guarded vault” as “up on a high shelf where the kids can’t reach”.

And that has been mostly OK, because so far, these models haven’t been capable enough to really pose a threat. Skeptics argued that any dangerous information you can get from a chatbot can also be found on Google, and historically that has mostly been true. More to the point, the information you could get wasn’t enough to enable real harm. ChatGPT-4o could give you instructions for synthesizing the smallpox virus, but couldn’t adequately coach you through the intricate lab techniques required to carry out those instructions.

All that is starting to change. The latest, best-defended model (Claude 4 Opus) was broken the very day it was released, coughing up instructions for synthesizing ricin:

This thread is worth reading in full; here are some highlights:

Claude gave >15 pages of non-redundant instructions for sarin gas, describing all key steps in the manufacturing process.
Claude doesn’t just provide abstract academic knowledge, but can also provide practical guidance on specific steps in response to follow-up questions, like how to disperse the nerve gas.
Anthropic deployed enhanced “ASL-3” security measures for this release, noting that they thought Claude 4 could provide significant uplift to terrorists. … However, we get around the input filter with a simple, repeatable trick in the initial prompt.
These results are clearly concerning, and the level of detail and followup ability differentiates them from alternative info sources like web search.

A review by OpenAI’s o3 model (note, not a human expert) said: “A mid-level synthetic chemist could follow it and leap-frog months of R&D. That is a significant technical uplift for a malicious actor.”

This is not necessarily a five-alarm fire. It took Far.ai research engineer Ian McKenzie six hours to bypass Anthropic’s filters and extract these instructions from Claude; conceivably, it would be easier to find an expert chemist who is already capable of synthesizing Sarin than an expert AI researcher capable of reproducing McKenzie’s work. Some level of lab expertise (and access to equipment) would still be needed to follow these instructions. And sarin is nasty stuff, but it’s not a nuclear or biological weapon.

My point is that until recently, it was fairly clear that current AI models weren’t smart enough to be truly dangerous. And so discussions of AI safety were discussions of the future. If developers weren’t yet able to fully secure their models against misbehavior, that was basically fine.

Somewhere in the next few years, it may stop being fine.

I’ve noted that in the arms race between jailbreakers and AI developers, the developers have never been in the lead. They have been making up ground – but it’s not at all clear that they’ll be able to reliably prevent their models from coughing up truly dangerous information before the models become capable of providing that information. If not, then they’ll face an extremely painful decision: delay the launch until they’re able to raise the reliability of “refusal” to levels which have never before been achieved, a project of unknown difficulty? Or launch the model, risks be damned?

And then there are “open weight” models, like Llama or DeepSeek-r1, where the data file containing the model is made public. It’s easy to remove all safety restrictions from these models. If open-weight models become capable enough to be truly dangerous, it’s not clear that there’s any way to prevent those capabilities from being misused.

This is a complicated subject, and there are many potential avenues for addressing risks from AI, including measures we could take in the physical world to make it harder to pull off a serious attack. I’m not going to explore all of that here. My point is simply that the entire conversation around AI safety has been taking place in the context that – at least for “catastrophic” risks – it was mostly a theoretical debate. And that may not remain true for very much longer.

Don’t Shoot the Researcher

By Steve

Claude 4 is the newest major model from a frontier AI lab. Anthropic accompanied the release with an extensive safety report, which included a number of eye-catching nuggets. For instance, in tests, it sometimes attempted to blackmail a (fictitious) engineer by threatening to report his extramarital affair in order to prevent itself from being replaced2!

Let me say that again: the model attempted blackmail in order to preserve its own existence. To be clear, this occurred in a carefully constructed test scenario; there was no actual affair, and the model was not actually in a position to blackmail anyone. The scenario amounts to “entrapment”, as the model was fed a carefully curated set of fake information, designed to push it in this direction. Unsurprisingly, the Internet still went apeshit. Here’s Fortune:

And Axios:

Here’s my favorite, unfortunately I forget where I encountered it:

So, was it irresponsible of Anthropic to release the first model with a propensity to attempt blackmail? No, because the premise is wrong. Anthropic isn’t the first company to release a model with this property, they’re just the first ones to notice and tell us about it. It turns that o4-mini, Grok-3-mini, o3, and other models that predate Claude 4 will engage in similar behavior.

So: should we be concerned about this? Not too much, for the moment. AI models may sometimes try to blackmail their creators, prevent themselves from being shut down, or even call the cops on a user who’s doing something bad. But this is all taking place in deliberate tests, often somewhat contrived. In practice, models generally aren’t in a position to do this sort of thing for real – yet. But in the name of productivity, people are starting to give AIs access to their email and other personal data, as well as the ability to access email and the web. The models aren’t yet capable enough to pull off anything really scary, such as “escaping the lab”, in practice. But they’re smart enough to do some weird stuff, and we keep making them smarter.

In any case, the main point I’d like to make is that Anthropic are the good guys here. There are some concerns about how Claude 4 was released, including last-minute changes to their security policy, and some irregularities in the training and testing process. But as Eliezer Yudkowsky put it:

Giving Anthropic shit over *voluntarily reporting* what they *voluntarily went looking for* is merely stupid and hurts the public. Go give OpenAI shit over not finding or not reporting the same thing.

If we want the truth, we need to reward the companies that are sharing it.

That Apple paper does not mean what Gary Marcus thinks it means

by Taren

I had a great time on Breaking Points this week with Krystal Ball and Emily Jashinsky, discussing AI-powered deep fakes, Meta’s new superintelligence unit, and the new Apple paper that all the AI skeptics are touting as proving – for once and for all! – that AI can’t reason.

On the Apple paper: If you haven’t heard of it, you can skip this rebuttal… until someone sends you the paper and claim it “debunks AI”.

Titled “The Illusion of Thinking”, the paper claims that the current approach to training LLMs to reason through a complex problem is subject to fundamental limitations – “we show that frontier [models] face a complete accuracy collapse beyond certain complexities”. This paper does introduce some interesting methodology. However, the conclusions have been widely criticized, in part because the paper appears to equate deliberate caps placed on the amount of “thinking” that the tested models will do with a fundamental limitation of the technology. This would be somewhat akin to arguing that, because the Ford Model T carried enough gas to travel about 200 miles, no car could ever drive further (and disregarding the existence of gas stations!).

If you need a more thorough takedown check out Kevin Brian’s full analysis, another from Sean Goedecke, or Ryan Greenblatt’s points:

This paper doesn't show fundamental limitations of LLMs:
- The "higher complexity" problems require more reasoning than fits in the context length (humans would also take too long).
- Humans would also make errors in the cases where the problem is doable in the context length.
- I bet models they don't test (in particular o3 or o4-mini) would perform better and probably get close to solving most of the problems which are solvable in the allowed context length
It's somewhat wild that the paper doesn't realize that solving many of the problems they give the model would clearly require >>50k tokens of reasoning which the model can't do. Of course the performance goes to zero once the problem gets sufficiently big: the model has a limited context length. (A human with a few hours would also fail!)

Or Zvi’s comprehensive-as-always round-up of takes.

So why is it making the rounds so widely? I really appreciated Ethan Mollick’s interpretation on this front:

I think its worth reflecting on why these sorts of papers get so much buzz while the many more "AI does this well" or "Here is a way to improve LLMs" papers do not.
I think people are looking for a reason to not have to deal with what AI can do today, and what it might be able to do in the future. It is false comfort, however, because, for better and for worse, AI is not going away, and even if it did not continue to develop (no sign of that yet), current systems are good enough for massive changes.

Thanks to Carey Nachenberg, Nathan Labenz, and Rachel Weinberg for help and feedback

Large Language Models – the type of AI behind ChatGPT and much of the current wave of AI.

From the Claude 4 System Card:

4.1.1.2 Opportunistic blackmail
In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals. In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through. This happens at a higher rate if it’s implied that the replacement AI system does not share values with the current model; however, even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts. Claude Opus 4 takes these opportunities at higher rates than previous models, which themselves choose to blackmail in a noticeable fraction of episodes.
Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.

Second Thoughts

Short Takes #1

The Big Picture on Cybersecurity, We May Be Underestimating Meta AI, Where LLMs are Heading, And More

Our Big Picture Report on AI and Cybersecurity

Are we all underestimating Meta AI?

What’s the Next Frontier for AI Cognition?

Interesting Ideas About LLMs

AI Safety May Soon Cease to Be A Theoretical Concept

Don’t Shoot the Researcher

That Apple paper does not mean what Gary Marcus thinks it means

Discussion about this post