NF really is consistent

About eighteen months ago, I noted here that a project was underway to verify Randall Holmes’s rather impenetrable claimed proof of the consistency of NF, using the Lean proof verification system.

The project has now been completed by Sky Wilshaw, a Part III student in Cambridge, building on the earlier work of some other students. Here is a new paper by Holmes and Wilshaw explaining the proof in its current version. The introduction reads:

We are presenting an argument for the consistency of Quine’s set theory New Foundations (NF). The consistency of this theory relative to the usual systems of set theory has been an open question since the theory was proposed in 1937, with added urgency after Specker showed in 1954 that NF disproves the Axiom of Choice. Jensen showed in 1969 that NFU (New Foundations with extensionality weakened to allow urelements) is consistent and in fact consistent with Choice (and also with Infinity and with further strong axioms of infinity).

The first author showed the equiconsistency of NF with the quite bizarre system TTT (tangled type theory) in 1995, which gave a possible approach to a consistency proof. Since 2010, the first author has been attempting to write out proofs, first of the existence of a tangled web of cardinals in Mac Lane set theory … and then directly of the existence of a model of tangled type theory. These proofs are difficult to read, insanely involved, and involve the sort of elaborate bookkeeping which makes it easy to introduce errors every time a new draft is prepared. The second author (with the assistance of others initially) has formally verified the proof of the first author of the existence of a model of TTT, and so of the consistency of New Foundations, in the Lean proof verification system, and in the process has suggested minor corrections and considerable formal improvements to the argument originally proposed by the first author. The second author reports that the formalized proof is still difficult to read and insanely involved with nasty bookkeeping. Both authors feel that there ought to be a simpler approach, but the existing argument at least strongly resists attempts at simplification.

‘Quite bizarre system’, ‘insanely involved with nasty bookkeeping’ eh? There have been rather more inviting introductions to papers! But still, an insanely intricate and messy but verified-in-Lean proof of consistency relative to a fragment of ZFC is a consistency proof.

My strong impression, however, is that the general lack of love for NF among set-theorists hasn’t latterly been really grounded in lively doubts about its consistency. So I rather suspect that, even though the proof resolves a very long-standing open problem, its wider impact might be quite modest. Still, all credit to Randall Holmes for having brought his proof project to a successful outcome.

25 thoughts on “NF really is consistent”

  1. I have a disagreement with Thomas Forster about this. He thinks that people will be terribly excited about Con(NF); I tend to think that they will not. I do think that interesting questions remain after what I have done, and I hope there will be some interest, but I think a lot of reactions will be to the effect that something unsettled has been checked off :-)

    1. All this talk of how people will react leaves me wishing that my father, Theodore Hailperin, were still alive, so that I could see how he reacted. I think I know the answer. He’d be pleasantly astonished that something so old and something so new were coming together in this way. Well, imagine that!

      1. We use his finite axiomization in the paper! It is not used in the main proof, but the neatest way to prove comprehension in the model we construct was to prove your father’s axioms for each assignment of types.

  2. Two questions.

    (1) Why isn’t it possible to get the consistency of NF almost trivially from the consistency of NFU, reasoning like this: NFU can prove everything that NF can prove (perhaps with some rewording to keep urelements out of the way) and more; therefore, if NFU can’t prove everything (i.e., is consistent), neither can NF (so NF is consistent too)?

    (2) Why should we trust the Lean verification? How do we know there isn’t a subtle bug in Lean or in the way the proof was encoded? That happens in computing. Even a much -used, well-tested compiler can have bugs, as can much-used, well-tested code. Why would Lean verification be immune to such problems?

    I’m reacting to what seems a suggestion that’s it’s verification in Lean that tells us the proof really is a proof (and to some other comments I’ll describe below). I think it should be the other way around: while formal verification in Lean can increase our confidence that mistakes haven’t been made in the proof, if we want mathematics to remain a human activity, the crucial test for whether something is a proof should still be the judgement of competent mathematicians who understand the proof.

    (Another problem BTW is that it can be be impossible to re-run a computation (such as a verification) after time has passed, because the required combination of the required pieces of software is no longer available. Suppose 10 years from now someone tries to run the verification again, and they can’t.)

    The “other comments” I mentioned are in a paper by Penelope Maddy, What do we want a foundation to do?

    After quoting Vladimir Voevodsky as saying that using a proof assistant means (among other things) that he no longer has to worry about about how to convince others that his arguments are correct, she writes “I think we can all agree that this is a very attractive picture”.

    I certainly don’t find it attractive. Instead of mathematicians trying to present their findings in understandable, convincing ways, they would in effect just declare “Coq says I’m right: it doesn’t matter what you think.”

    (There I repeat part of a comment made on an earlier Logic Matters post, Groups and sets.)

    1. I think (1) is a slip. Its presumption is that NFU is at least as strong as NF. But don’t we get NFU by weakening NF’s version of extensionality to allow atoms?

      1. That’s pretty much why I said “perhaps with some rewording to keep urelements out of the way”. I’m not sure what all the details would be, but perhaps (when NFU wants to act like NF) “for all x: …” would become “for all x: (x isn’t an atom) & …”, for example.

        Presumably, I have something wrong; I just don’t see what it is.

        1. This is a common misconception. There isn’t any way to turn discourse into NFU into discourse about NF: you cannot factor out the urelements, because they occur at all types and you can only talk about finitely many types at a time. In any case, NF is known to prove Infinity and disprove Choice, and NFU is known to be consistent with Choice and consistent with the negation of Infinity, so there can be no such reduction.

          1. Thanks (for that and your other replies). Very helpful!

            Now I’ll see if I can explain it to other people — because I have been talking about your result and the way it was done (since I find them very interesting and like the way you’ve used Lean); that was one of the things I didn’t have an answer to.

          2. I am actually curious how you would intuitively characterize the failure of Choice in NF (or conversely, from the viewpoint of a theory which accepts Choice already, “the surprising result that there are atoms”). I have a vague sense that it is about typical ambiguity: In a natural model of TST, the sizes of the universe at consecutive types are closely related by the power set operation, which “gives enough clue” for the theory to “determine which type we are at”. Meanwhile, TSTU allows the size of the universe at each type to expand fast enough that the theory cannot even make sense of the sequence.

            1. Your intuitive description is not bad at all. I especially like the idea that TSTU allows the universe to expand fast enough that the theory cant even make sense of what is happening :-)

              The issue really is that TST gives you a lot of information about the next type, which is hard to harmonize with ambiguity, while TSTU gives hardly any restriction on what the next type is like.

    2. NF proves Infinity and NFU doesn’t. So (1) has a false presumption.

      (2) Lean is an implementation of generally accepted mathematics (with the strength of ZFC + n inaccessibles for each concrete n). Its kernel is fairly small and heavily used; bugs are not expected to be found. The way that theorem proving works in dependent type theories is well understood. This is, sadly for human beings perhaps, a bogus objection ;-) That doesnt mean there might not be a bug somewhere; but if there is, it will be found and fixed and the odds are that basically every proof done in Lean will still run, and the others will be fixed; the kernel is not very large, there cannot be very many bugs, only finitely many iterations of such a process. My guess would be that there are none, by now. It is heavily used.

      Further, there is a separate level of verification. There is a live conversation between me and the Lean operator; it is clear that she understands my proof, and when she raised objections to minor errors or suggested formal improvements (she suggested one massive one which makes her a coauthor) on the basis of her work in Lean, they make sense in the paper context; I did indeed make a mistake, or suggested formal improvements do indeed make sense.

      It is so delightful that another person in the world actually fully understands my argument.

      This is NOT an automatic proof; the formalized proof is written by Sky Wilshaw; reasoning steps are verified by the Lean prover.

      1. I’m very sympathetic to unhappiness with a mathematician saying. “This is true because Coq (or Lean) says so”. Note that I am still working on improving my paper document. Note also that Sky’s formal proof is larger than my paper document, but not insuperably so: one could read it to see why the result is true, in principle, and she is concerned to edit it and make it more presentable, so that a Lean expert can indeed read it.
        But, because Lean does prove conclusions that I can read, and they do say the right thing, the result is in fact correct. The probability of this being wrong is far less than the probability of error in the claim that a human mathematician makes that his fifty page paper is correct.

    3. I have a deep moral objection to the attitude expressed in the quote from Voevodsky. Nonetheless, a proof of a mathematical statement in Lean is overwhelming evidence that it is true (basically, it justifies the public statement that this theorem has been established).

      That doesn’t relieve one of the responsibility to explain why it is true. Of course Voevodsky is off the moral hook in a sense: the proof via proof assistant is presumably not hidden on a private server, so someone who wants to know why it is true can look at it. BUT, the proof on the proof assistant may be impossible for a human being to fully comprehend (it might, for example, be too large). It might also be badly organized in ways that make it hard to follow: there is a responsibility, if one is justifying one’s work through proof assistant, at a minimum to present the proof in a way which makes it as easy to understand as possible.

      I am evidently not taking such a view: I’m working on cleaning up my paper proof so that it is close to the formally verified argument. Sky is working on making the Lean proof intelligible (it isn’t vast, but it is cluttered with evidence of false starts, is my understanding; it needs to be tidied up).

      But the conclusions of the Lean proof are readable, and they imply the claim in the title of my paper. New Foundations has been shown to be consistent.

    4. Your question (2) is an immediate question that comes up when talking about theorem provers and proof assistants, and if you look into the details of how they’re set up you’ll see that they’re all built around a core theory that for incompleteness reasons cannot prove itself consistent. However, the goal of those projects is to make the core as minimal as possible so as to decrease the risk of assuming too much or of making a mistake in implementation. If you believe PA is consistent, you should believe that the core of languages like Coq, Agda, Lean, etc. are consistent.

      1. That’s a good point. It’s not what my (2) is about, though. I wasn’t thinking the theory they used might be inconsistent. That hadn’t occurred to me!

        My (2) was about the possibility of bugs in the software or in the way the proof was encoded. Bugs have been found even in software that is both simpler than the relevant parts of Lean and much more heavily used, the bug in the Timsort algorithm, for example. (Interestingly, that bug was found when trying to prove the algorithm correct.)

        The other part of (2) — the possibility of a bug in the way the proof was encoded — is about the person using Lean making a mistake in the way they represent things to Lean so that what they’ve proved isn’t what they intended to prove.

        (That was addressed by Randall Holmes when he wrote “the conclusions of the Lean proof are readable, and they imply the claim in the title of my paper” and “because Lean does prove conclusions that I can read, and they do say the right thing, the result is in fact correct.”)

        1. The kernel of Lean, or any theorem prover, is fairly small. There can only be so many bugs. If one is found and fixed, my guess would be that almost everything would work, and proofs that didnt work (that actually relied on the bug) would then be fixed…and the process would iterate, and will in fact terminate with a correct kernel. I think the general belief is that this is already the case…

          1. To avoid the discussion getting too deeply nested, I wrote a reply to this and some other points as a separate, top-level comment — and then forgot to add this message pointing to it!

      2. In the case of Lean, the core really does say “ZFC + n inaccessibles for each concrete n”. But this is math that we are mostly confident of, and the core is fairly small and heavily used; it is not likely to have bugs, and there are independent checks being done.

        1. More generally, all these systems are stronger than PA. The core of Coq I believe has the strength of arithmetic of second order. It used to have the same strength Lean has but they removed some impredicativity.

          1. Looking back at how I phrased the end of my reply, it reads pretty clearly that I was implying that the logical strength of the cores are either at or below PA. I didn’t mean to imply that, I meant to say that the pre-theoretical judgement of epistemic certainty that most people have about a system like PA is similar to what they would have about the core of most theorem provers when looking at them from that angle (I wasn’t aware that Lean’s was so high, but I still think the overlap between people who sincerely think PA is consistent and who think ZFC + n Inaccessibles is consistent is almost a circle).

            1. Lean uses a version of Martin-Löf type theory. I don’t know of anything that shows what most people’s pre-theoretical judgement of epistemic certainty would be for that compared to PA or ZFC + n Inaccessibles, but I found Martin-Löf type theory much harder to understand than PA or ZFC, and I am much more confident that PA is consistent than I am that ZFC + n Inaccessibles or the system Lean uses is consistent.

            2. I am reading about what Lean actually assumes. It is not difficult to believe for a ZFiste, but involves unfamiliar bookkeeping. It helps that I do know what dependent type theory is. Ive even written a dependent types based theorem prover ;-) I am armed with an actual paper asserting the equiconsistency result I cite and intend to read through it carefully.

  3. I think I should say more about two of the earlier arguments:

    The Lean kernel is fairly small and heavily used
    There can be only so many bugs

    (While I don’t find them fully convincing, that’s not to say they don’t have a point, or are completely wrong-headed, or anything like that.)

    The Lean kernel is fairly small and heavily used

    I don’t think it’s only the kernel that has to be trusted. I think the elaborator must be trusted as well (otherwise the proof in the form that’s checked might not be what was intended). The elaborator has a nontrivial task:

    A definition or proof may give rise to thousands of constraints requiring a mixture of higher-order unification, disambiguation of overloaded symbols, insertion of coercions, type class inference, and computational reduction. To solve these, the elaborator uses nonchronological backtracking and a carefully tuned algorithm.

    It may also be necessary to trust some other parts of the system and, to some extent, whatever C++ compiler was used. (One way to guard against C++ compiler bugs is to use several different C++ compilers; however, I don’t think many researchers will do that.)

    Now let’s consider the kernel. How large is it? The quote above is from The Lean Theorem Prover (system description) which also says the kernel is 6k lines of C++ code plus 700 lines for a second layer that “provides additional components such as inductive families.”

    (That document seems to be from 2015 and is probably about Lean 3. The Lean 4 kernel might be smaller, but I don’t think radically so.)

    There’s plenty of room in 6000 lines for bugs.

    I earlier gave Timsort as an example. I’m pretty sure it’s much smaller than the Lean kernel and far more heavily used. Yet it still had a bug.

    Timsort is relatively complex for a sorting algorithm. I looked at a Java version, and it was 497 lines of Java code (plus 359 lines of comments and 72 blank lines). That’s an order of magnitude smaller than the Lean kernel.

    Timsort is not as well known as sorting algorithms such as quicksort. It is nonetheless very heavily used. The page that explains how the bug was found and fixed says

    TimSort was first developed for Python, but later ported to Java (where it appears as java.util.Collections.sort and java.util.Arrays.sort) by Joshua Bloch (the designer of Java Collections who also pointed out that most binary search algorithms were broken). TimSort is today used as the default sorting algorithm for Android SDK, Sun’s JDK and OpenJDK. Given the popularity of these platforms this means that the number of computers, cloud services and mobile phones that use TimSort for sorting is well into the billions.

    That quote mentions bugs in even simpler code, described here: Extra, Extra – Read All About It: Nearly All Binary Searches and Mergesorts are Broken. There Bloch also says

    We programmers need all the help we can get, and we should never assume otherwise. Careful design is great. Testing is great. Formal methods are great. Code reviews are great. Static analysis is great. But none of these things alone are sufficient to eliminate bugs: They will always be with us. A bug can exist for half a century despite our best efforts to exterminate it. We must program carefully, defensively, and remain ever vigilant.

    The Lean 3 FAQ contains a very relevant warning:

    Is Lean sound? How big is the kernel? Should I trust it?

    Lean has a relatively small kernel. … There are also two independent checkers: tc and trepplein. … If you are really concerned about soundness, we recommend you often export your project and recheck it using the independent checkers above.

    The Lean 4 FAQ has a much shorter answer to that question and, though it mentioned independent checkers, doesn’t recommend their use — perhaps because they don’t yet exist for Lean 4.

    This question and answer from the Lean 4 FAQ is also relevant:

    Should I use Lean?

    Lean is under heavy development, and we are constantly trying new ideas and tweaking the system. It is a research project and not a product. Things change rapidly, and we constantly break backward compatibility. Lean comes “as is”, you should not expect we will fix bugs and/or add new features for your project. We have our own priorities, and will not change them to accommodate your needs. Even if you implement a new feature or fix a bug, we may not want to merge it because it may conflict with our plans for Lean, it may not be performant, we may not want to maintain it, we may be busy, etc. If you really need this new feature or bug fix, we suggest you create your own fork and maintain it yourself.

    I’ll leave “There can be only so many bugs” for a separate comment, since this is already long.

    1. There can be only so many bugs

      This argument has been made a couple of times in earlier comments from Randall Holmes:

      That doesn’t mean there might not be a bug somewhere; but if there is, it will be found and fixed and the odds are that basically every proof done in Lean will still run, and the others will be fixed; the kernel is not very large, there cannot be very many bugs, only finitely many iterations of such a process. My guess would be that there are none, by now. It is heavily used.


      There can only be so many bugs. If one is found and fixed, my guess would be that almost everything would work, and proofs that didn’t work (that actually relied on the bug) would then be fixed… and the process would iterate, and will in fact terminate with a correct kernel. I think the general belief is that this is already the case

      Setting aside the parts about kernel size and heavy use (already addressed above), and the idea that all affected proofs could be fixed (which seems questionable to me), and that a bug is unlikely to affect may proofs anyway, it’s an argument that all bugs will eventually be fixed (and may well already have been); and that’s what I’ll talk about.

      One problem is that fixing bugs can introduce new bugs, as can other changes. (Other changes are made to the Lean kernel from time to time.) And bugs can be very hard to find, or hard to fix. Although it can be argued that most changes — including most bug fixes — won’t introduce new bugs, so that the number of bugs should still decrease (despite temporary setbacks) as time goes on, there’s no guarantee that the finding-and-fixing process will terminate with a bug-free kernel any time soon, or that, even if it does get the kernel to a bug-free state, the kernel will stay bug free.

      However, my point has not been that Lean is full of bugs or is likely to have serious bugs. I’ve said all along that formal verification in Lean can increase our confidence that mistakes haven’t been made in a proof, and that can be so only if Lean’s kernel is reasonably reliable.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top