Philosophical Hacker

Anthropic's Argument for Mythos SWE-bench
improvement contains a fatal error
2026-04-26
Mythos’ system card contains the following graph to support its argument that Mythos performs better on SWE-bench: Anthropic and others are worried LLMs are memorizing SWE-bench, so they asked an LLM to estimate the probability that a solution is memorized. Next, they calculated the pass rate if they only included solutions an LLM judged to be memorized with less than 5% confidence, 10% confidence, and so on. Picking a point on the graph for example: if they include ~400 out of 500 solutions because an LLM has judged them as memorized with a probability <= 60%, Mythos’ success rate is ~92% while Opus 4.…
LLMs and the Russellian Inversion
2025-08-23
Bertrand Russell, the 20th century philosohper and mathematician, once said: The fundamental cause of the trouble is that in the modern world the stupid are cocksure while the intelligent are full of doubt. LLMs are introducing something like this dynamic into programming. The programming version of the statement is: The fundamental cause of the trouble is that in the modern world the less experienced programmers are writing much more code than the more experienced ones.…
llms
Libraries are under-used. LLMs make this problem worse.
2025-07-30
Libraries are under-used. Why? Briefly: Writing code is more fun than reading documentation. We tend to understimate the complexity of problems we don’t understand well, so we undervalue libraries that solve these poorly understood problems. Perverse incentives: libraries compete with big internal engineering projects that look good in a promo packet. LLMs make this problem worse. Why? Less briefly: Vibe coding is more fun than reading documentation. Shit, vibe-coding can be more fun than ordinary coding.…
LLM Proofing Our Takehome Challenge
2025-06-20
Our original idea for our coding challenge was to ask candidates to build a tic-tac-toe game in React with a few curve balls thrown in around making the solution more general for larger game boards and play modes. We scrapped that idea when we discovered ChatGPT could trivially do this. Here are some things we did to “LLM-proof” our new challenge. But first, why the scare quotes around “LLM-proof.” Because I’m sure someone could get an LLM to solve the new challenge, but the quality of prompting that would be necessary to pull this off would be an indication of a strong engineer.…
Value-based pricing can be a trap for early startups
2024-10-03
Founders are often told to price based on the value they are providing to their customers. For example, if you’re saving your customer 1 million dollars, charge a 10th of that. Here’s Kevin Hale at YC advocating for this approach: In startups, and almost pretty consistently across all businesses, everyone will tell you, you should strive for value-based pricing. It allows you to charge a whole lot more. It allows you to manipulate this incentive to buy.…
startups
Reading Sqlite Schema Tables the Hard Way
2024-09-25
Parsing a Sqlite database file is a nice way to brush up on data structures, bit manipulation, and recursion. I know this because I recently implemented the read_schema_tables function below such that the following test passes: import sqlite3 def test_read_table_names(db_file): con = sqlite3.connect(db_file) assert list(con.execute("SELECT * FROM sqlite_schema;")) == list(read_schema_tables(db_file)) read_schema_tables doesn’t use the sqlite3 python package. That’d make for a trivial test case and would be too easy.…
Which developers care most about security?
2024-08-06
We’re thinking about building a product for developers that enables them to build applications that operate on encrypted data via homomorphic encryption. We think developers have seen enough data leaks to want a product like this, but we’re worried we’re wrong. Even if we’re right, we’re worried about finding specific devs who can be early adopters. Where do these devs live? What languages do they use? What kind of companies do they work for?…
We Need Another Code Copilot
2024-06-21
I’ve been an programmer for a decade, and I can’t believe how much wasteful code we write. Even more unbelievably, many of us “justify” our waste with vague appeals to “clean code” or “best practices.” I used to do this all the time. These vague appeals — and the religious fervor that often accompanies them — betray a common lack of serious thinking about what makes code useful vs. wasteful. Instead, we have lots of shouting:…
On OpenAI's supposed "scientific certainty" that GPT-5 will be better than GPT-4
2024-05-13
When we were raising money for ATLAS, I often told investors that my cofounder and I were probably the most skeptical GenAI founders they would meet. Sam Altman’s recent hyperbolic claim that Open AI has “scientific certainty that GPT-5 will be better than GPT-4” at Stanford University fuels this skepticism: I’m impressed by OpenAI, we use their models, and I’m sure Sam is a nice guy, but I cannot imagine that whatever evidence they have to think GPT-5 will be better than GPT-4 would be enough for “scientific certainty.…
The UIs ChatGpt Wont Replace
2024-04-29
What percentage of traditional UIs will be replaced by chat-based experiences? I’m building a AI/LLM-powered app guide and company that’s betting the percentage is low. Sci-fi is a part of what guides my intuition here. Jarvis didn’t make GUIs obsolete for Tony Stark. GPT-X won’t make them obsolete for us. Even if you don’t like sci-fi as a guide for the future, you can see that chat-based UIs won’t dominate it by looking closely at how we use computers today.…

1
2
3
4
5