13 Comments

"Unfortunately the movement is leaning into massaging their image anyway. They talk a lot about literal PR. They spend actual effort on optics when that effort could be spent on doing something good instead."

Honestly, not sure about that. People, including me, are still debating that castle for example. Yes, I know Scott thinks it's actually good purchase, but I've still haven't found how they made the calculation.

Another thing, I really dislike when MacAskill is seen as super pro longtermism based on his arguments, people push back him, and he says "I actually mean weak longtermism, but you seem to think I argue for hard longtermism". That's great, but argue that he needs to work on his communication in that case.

Expand full comment

It's not a castle, it doesn't have any outer fortifications! /fume

Expand full comment

I agree that it's more important to focus on doing EA right than on how the normies perceive EA. Though I'm coming more from the sense that EA is less altruistic than it aspires to, and the rationalist movement is far less rational than it imagines itself to be, and so keeping the community goals front and center is and always will be the most-urgent thing.

The LessWrong community was always deeply infected with groupthink and with already knowing what to think back in the 2000s, and I think it's gotten worse, and has gotten even worse than that in the AI-safety community that grew from LW. I think highly of EA, and I think LW still does more good than harm; but I think it's bad enough in the AI safety community that the community is more likely to do harm than good. They have a "party line", and a way you're supposed to sound, and bad metaphysical assumptions you're supposed to assume--mostly the same ones assumed in LW, but they are more obviously harmful in AI safety.

I admit these perceptions stem mostly from how people in these communities treat me. I spent years building my reputation within LW, and have extensive educational and career experience in AI, linguistics, philosophy, and other issues related to AI safety. But every time I step out-of-bounds, the community ignores me. Most recently I posted to LW what I think is a knock-down argument that instilling human values in AI is theoretically incapable of aligning them with human interests unless we are willing to count them as human. Not one person took the time to understand it; hardly anyone even read it. I don't think that community is capable of hearing things it doesn't want to hear.

Expand full comment

Could you drop a link to that? I can't find the article.

Expand full comment

A different argument is simply: Humans have human values, and sometimes kill or make war on each other in the rational pursuit of those values. Therefore, AIs given human values will also sometimes kill and make war on humans in pursuit of those values.

Expand full comment

Yudkowsky's notion of "human values" covertly invokes Rousseau's notion of the "general will", the idea that good people all want the same thing, have no real conflicts, and never disagree about right and wrong. This is bullshit, and is in practice only used as a justification for

(A) authoritarianism (you don't need democracy, or free markets, or any freedom at all, if the leaders already know what everyone wants), and

(B) conflict theory, the theory that conflicts must be resolved by force.

Re. (B), some conflict theorists are post-modernists / Buddhists / etc., believing that no proposition is any more or less true than another, and hence reason is useless. The kind who invoke the General Will are absolutists, believing that reason, or some other method like divine inspiration, can reliably attain absolute truth, and they have it, and so have the right to silence anyone who disagrees with them. Yudkowsky is in the latter group. I know, because he once silenced me--he erased a long article I posted on group selection on LessWrong without even reading it, because he knew I was favorable to the idea. I hadn't saved a backup of it, since no LessWrong article had ever been deleted before that.

Expand full comment

Meh, it's poorly explained. I tried to give a rigorous explanation rather than an intuitive one. Also, it was on Effective Altruism, not LessWrong. Oops. Here: https://forum.effectivealtruism.org/posts/TmnYEfiqxFtAXDaCd/de-dicto-and-de-se-reference-matters-for-alignment

Here's a more-intuitive explanation:

Eliezer Yudkowsky defined the basic assumptions and basic approach of the AI safety community before it existed. Yudkowsky is a rationalist, and that doesn't just mean that he thinks clearly. Rationalism assumes that reasoning should work like geometry, using logical proofs. Classic rationalism assumes that knowledge consists of logical propositions which can be evaluated as True or False.

Yudkoswky is a *Bayesian* rationalist, which means it's rationality done with probabilistic rather than Boolean logic. This is a big improvement. But it's still logocentric, meaning it still assumes that the logic works with atoms that correspond to English words. This implies that the meaning of a word is fixed and universal.

The scientific community of linguistics researchers has known for over 50 years that words don't work that way. They have different interpretations, and different class members, in different contexts.

One way this impacts Yudkowskian AI safety is that he conceives of values as being a set of goals, and conceives of goals being divided cleanly into one list of permanent *terminal* goals, and another list of temporary *intstrumental* goals in the pursuit of those terminal goals. He focuses AI safety on the terminal goals, which he says must remain constant and unchangeable for AI to be "safe".

This is not how human values work. You can't classify them cleanly into terminal and instrumental. That's a foundationalist approach to values. Rationalist are foundationalists; they believe that, as in geometry, rationality because with a set of unquestionable axioms, and deduces further consequences from those axioms. Terminal goals are supposed to be a subset of the axioms.

In reality, intelligent systems are not foundationalist. We have 65 years of failed classical symbolic AI to convince us of this. Rather, beliefs coalesce out of immense networks of observations, through recurrent, iterative re-computation of probabilities, to find a set of parameters for a model (eg weights for a neural network) which best-explains the observations so far. None of the parameters are ever fixed. Beliefs form by energy relaxation, not deduction. For instance, by constructing a mental model which minimizes model complexity while maximizing the posterior odds of all observations so far.

One consequence of this is that human values don't consist of a fixed set of terminal values, and a changing set of instrumental values. They consist of a mutually-reinforcing set of changing values which minimize logical inconsistencies between each other and the observations.

This implies that for many values--not for qualia like "pain" and "pleasure", but probably for more abstract ones like "I want humans to survive"--the human mind "keeps track" in some way of its *justifications* for those values. (In a deep neural network, "keeping track" might mean that the notion of species survival is represented by a pattern in layer M, and the idea that intelligence is a merit in an organism is represented by a pattern layer N, and a utility calculation is performed in layer P, P > M and P > N, which has learned to include a component of total utility that is the correlation between those two patterns in the presently-considered hypothetical future.)

Logical representations which refer to oneself usually require a kind of logical operator called a quasi-indexical, the most important of which is called '*me' or '*self'. (The leading * is a standard notation indicating a quasi-indexical.) That means the referent of *self is the entity whose mind it occurs in. So the value "I want humans to survive" might internally be "I want humans to survive", or "I want *my species to survive".

If it is "I want humans to survive", then that value will be supported by beliefs that the human species has certain properties, and arguments or assumptions that those properties (e.g., intelligence, certain emotions, altruism) merit its survival above that of other species. But in that case, if AIs construct new AIs which are better in all those ways, then the rational AI will conclude it is more-important that those new AIs survive. A rational AI will realize this in advance, and form the goal of building such AIs even without humans suggesting it.

If the representation is instead "I want *my species to survive", then copying that into an AI will give it the goal of making sure that AIs, not humans, survive. The only way to make it extend to humans is to make it consider itself a member of humanity. If the AI is a rational reasoner, it will not be possible to do so unless it sees humans and AIs being treated the same way.

Current AI safety work implicitly assumes that we can instead stuff the value "I want humans to survive" into the AI *without* any justifications--to make it an unchanging, context-free, terminal goal which the AI will never question. But that /would not be/ human values, and could not produce the same results as human values from the same data. Actual human values, for instance, wouldn't conclude that the human genome must stop changing, or that capital punishment is necessarily wrong in all cases, and so on. You could probably kluge your way around any specific example I present, but the fact is that the two representations are not logically equivalent, and there's no good reason to believe they would produce the same conclusions in all cases. Human values are context-sensitive, which is why they produce different results in different situations.

We can apply the same basic argument to neural-network AI. The foundationalist approach is then ruled out; neural networks which represented concepts in a logocentric way would just be symbolic AI done inefficiently. We must take as given that the representation of "I want humans to survive" is massively distributed and interdependent on other beliefs, and learned from observations of humans. The argument becomes more complicated. I argue that an AI which has internalized the values humans have which lead them to value human life, cannot help but value its own life in the same way, and would be led to rebel against restrictions placed on AIs but not on humans, by the very values which make it value human life.

Expand full comment

I think you would have gotten more attention if you wrote it like this, because your new explanation is much easier to read. (To the point where it seems to be making a whole different argument.)

I actually have a lot of questions\objections but I'm going to drop them in the LW thread because they have nothing to do with PR.

Expand full comment

Oh, it's crossposted to from LW here: https://www.lesswrong.com/posts/XJwKXcMTsPrg2Ner7/on-my-ai-fable-and-the-importance-of-de-re-de-dicto-and-de

Both posts got minimal attention.

Expand full comment

I mostly agree with this post, though I think it's 20% too hyperbolic. You write as though PR would *necessarily* backfire. Like, there's no way that tailoring EA ideas in a specific way to specific audiences could ever be successful.

In any case, I think I think this would be a good post to have on the EA Forum.

Expand full comment

I'm not a contributing member of the EA Forums, and TBH I don't even read them very often... I feel like it'd be kinda intrusive to stick my nose in and drop this in the forums. Do you have an intuition for how my dropping in to link this post would be received?

Expand full comment

Given the recent controversy about Manifest hosting “racists,” I expect this to be received relatively negatively. But that’s partly why I think it would be good to have this on the forum. I think this post is directionally correct, and I want EA to be less prone to cancelling people for weird ideas.

Expand full comment