NHacker Next
login
▲How much information is in DNA?dynomight.substack.com
83 points by crescit_eundo 3 days ago | 62 comments
Loading comments...
chromatin 2 days ago [-]
The article massively undersells the information content of the genome in several key ways. A non-comprehensive list of these (before my morning coffee forgive me) includes:

- DNA methylation (https://en.wikipedia.org/wiki/DNA_methylation)

- Interactions of alleles (what article refers to as the "two versions of each base pair")

- Duplications, deletions, inversions, and other structural variations (https://www.genome.gov/genetics-glossary/Structural-Variatio...)

- Physical proximity interactions in 3-dimensional space (https://cmbl.biomedcentral.com/articles/10.1186/s11658-023-0...)

- Combinatorial effect (massive) of different alleles in complex systems

Overall, it's not sensible to compare a linear sequence of bits, like a CD (sibling comment) or DVD (the article), to the linear sequence of the genome and conclude that their information content, based on length alone, is in any way comparable.

Daniel_sk 18 hours ago [-]
Exactly. The compression level of DNA is magnitudes better than anything we can even come close to. DNA usually doesn't even contain specific counts (like 5 fingers on hand) or sizes of organs and so on - these are given by the processes that run in parallel and cause the cells to hit spatial / chemical / electrical or other limits. It's like putting lots of house builders on specific places where the house should be and each one would just keep building a wall until the he hits another one. There is no compressed house plan, it's a compressed "engine" that builds the result.
Earw0rm 16 hours ago [-]
Comparing it to machine code on CD/DVD might make more sense then. Machine code where every line has been hand-optimised by nature's hackers over 500 million years.

And in that context, hundreds of MBs is a heck of a lot of complexity.

clickety_clack 2 days ago [-]
You put my reaction to this in much more educated terms. I’ve always felt that thinking of DNA as bits was a bit simplistic. Just because we store information as bits it doesn’t mean that nature does.

Not that it means they can’t be right, but the author also doesn’t seem to have any particular expertise in genetics. Their ideas need to survive a lot more criticism by people who know what they’re talking about before you could start to see them as convincing.

ses1984 1 days ago [-]
T he raw bits of the base pairs is just one component of the information, but it’s like a maximally compressed version of the info.

The laws of physics are another component.

From there you would need to simulate nature to be able to decompress all the data, like how computer programs can use procedural generation.

Imagine a game like Minecraft. You can generate practically infinitely many screenshots of Minecraft worlds, but all that data can be derived from the game code and the jvm.

jampekka 21 hours ago [-]
> T he raw bits of the base pairs is just one component of the information, but it’s like a maximally compressed version of the info.

This sounds a bit suspect. Maximally compressed version would be very sensitive to mutations which wouldn't be great for adaptation via mutations. My understanding is that only a small fraction of mutations lead to unviable phenotypes.

Also AFAIK the current understanding is that majority of DNA is "junk", i.e don't seem to affect the phenotype. Which would be a partial explanation for the above.

The process of genetic expression is indeed something like procedual generation, but if maximal compression is about something like Kolmorogov complexity, the produced phenotype doesn't contain more information than the genetic information.

deng 2 days ago [-]
He does mention structual interactions as well as duplications/deletions/inversions. I would argue methylation is more like an annotation of DNA and not part of the DNA itself, but that's a matter of opinion.

In the end, the author literally says: "nobody knows". Yes, you cannot compare a linear sequence of bits to a macromolecule that interacts structurally with its environment, and the author does not make that claim. The question he tries to answer is: how much data is needed to re-create a similar macromolecule that interacts in a similar way. His main point, in which you both agree: only the exons are surely not enough because the encoded proteins are just a (small?) part of how DNA interacts.

18 hours ago [-]
kjkjadksj 2 days ago [-]
Exons are almost like functions where as a gene is almost like a class definition. In different tissues in the body a gene might be alternatively spliced to lead to different protein isoforms. In effect, making use of only a subset of available functions in the class depending on certain input parameters or how the class is called.
throwanem 1 days ago [-]
This is a Star Trek version of the subject, in that it is pure technobabble which happens to mention a few real terms.
acchow 1 days ago [-]
As someone with both a biochem and CS background, I found the comment insightful and clear. Zero technobabble to my ears.
throwanem 22 hours ago [-]
What does casting biochem in the metaphor of CS abstractions, in this example, clarify? What does it elucidate? What further predictions does it allow us to make about either subject of the metaphor? Can those predictions be tested? Do they make sense enough for that to be a meaningful question?

Show me how this isn't a more confusing than useful explanation, even for the bright ten-year-old or so at whose level it appears to be pitched, and I'll grant it may have some value.

foobarian 2 days ago [-]
I find that even if this just provides a lower bound it is still an interesting piece of information.
lotharcable 1 days ago [-]
Yeah...

We know now that environmental factors change how DNA is expressed as well through epigenetics.

I don't know how any of it works. Something to do with the shape the DNA when it is wound up and how it changes the output when RNA produces proteins.

This is how parents can do things like pass some of the athleticism they earn through training to their children. It is possible for athletic parents to pass genes in such a way that it produces children even more athletic then they were.

All of this means that DNA has the ability to encode information and produce proteins in different ways using the same sequences.

So I am guessing that a lot of the DNA that is considered "junk" may not actually be. They are just missing a piece of the puzzle in how it gets read in.

moralestapia 1 days ago [-]
But all of those emergent effects are accounted for in the DNA sequence [1], so the estimate is fine.

1. Maaaaybe you could make a case for DNA methylation, but that still requires some DNA signatures so ...

2 days ago [-]
stenl 2 days ago [-]
A much more detailed and thoughtful (and peer reviewed) take on the same question from my colleague Jussi Taipale: https://www.embopress.org/doi/full/10.15252/embj.201696114
vintermann 2 days ago [-]
Information can only be defined with respect to states where you 1. Can tell (or could in theory tell) the difference and 2. Care about the difference between states. The differences you care about, and the ones you don't, are baked in whenever you use any definition of information.

It doesn't matter much, unless you use it to sneak in what you think we should care about, or use it to make philosophical arguments whose circularity is carefully hidden.

tringuyen_cse 1 days ago [-]
I have a similar view. The question of how much infomation by itself does not matter without some context/application.
tetris11 2 days ago [-]
I thought the main advantage of DNA storage was the physical size of it, and how many different genomes you could have stacked next to each other in the same -70degree space.

Millions of chimeric cells on the same petri dish? That's 1PB on a single glass slide.

Depending on the sequencing tech paired with the rise of Spatial data, the read speed could be formidable.

Needlessly complex setup though. Let's just stick with metals for now.

out_of_protocol 2 days ago [-]
DNA self-desintegrate very fast. It only works in living cells because it is being repaired non-stop
throwanem 2 days ago [-]
Even reading is a destructive process, and the physics involved are incomprehensibly complex by comparison with anything in the digital domain.
kjkjadksj 2 days ago [-]
There are ways to read it nondestructively. One way does trade resolution but once prepped the DNA itself can be imaged to be read.

https://en.wikipedia.org/wiki/G_banding

throwanem 1 days ago [-]
That does not read DNA. A chromosome is not a strand and a karyotype is not a sequence. In any case, I described what a ribosome does.
coolcase 18 hours ago [-]
How do forensics for old cases work? Is it because of probability. You don't need all the data but if enough of it matches it convinces the scientist of the match?
shishironline 2 days ago [-]
Sorry, it is one of the most stable organic molecules and can stay intact for thousands of years. That is why the Jurassic Park like fantasies are based on a truth and many extinct species have been brought to life through DNA in reality too.
chermi 2 days ago [-]
I think maybe they are talking about the very tightly packed yet still functionally accessible 3d structure that is chromatin, not individual strands.
misnome 2 days ago [-]
No, they haven’t. Any claims otherwise are as real as “T Rex Leather” handbags.
gitroom 2 days ago [-]
Man, the back and forth here before coffee is actually kinda hilarious - I get all worked up before caffeine too, but honestly, DNA being this messy scratchpad feels way more interesting than treating it like a tidy CD. The messiness kinda rules, if you ask me.
sgt 23 hours ago [-]
Exactly why I prefer organizing my windows (on my computer) in a chaotic way, rather than a tiling manager. That's the DNA way of doing it ;)
RainbowcityKun 2 days ago [-]
- Cells work like this because DNA is under constant attack from mutations. - Mutations most commonly arise during cell replication.

It's fascinating to realize that the "messiness" of DNA isn't a bug, but a feature—a side effect of evolution's raw material supply chain.

Mutations, repeats, transposons, and imperfect repairs all contribute to a noisy genomic landscape. But it's exactly this noise that enables biological diversity. No mutations, no variation. No variation, no selection. No selection, no evolution.

The genome is not a blueprint—it's a living, adapting scratchpad. Messiness is the canvas on which nature paints diversity.

esafak 2 days ago [-]
Don't forget sexual reproduction.
nickpsecurity 1 days ago [-]
Let me add to that. It requires a universe with specific laws that remain stable and encourage optimization. Then, a planet hospitible to life. Then, specific creatures with biological machinery more complex than anything humans have created. The machinery has plenty of reliability and adaptation baked in.

Godless evolution suggests randomness produced all of it overtime. Yet, that's never worked in anything we've built. Even our GA's required laws, an environment, a computer, software, and fine-tuning. Pre-existing or by intelligent design (human inventors). Without these, it produced no results.

So, I'll correct you by saying empirical data suggests evolution didnt produce this. We're seeing God's design skills in adaptive, resilient, complex, self-replicating systems. His work is truly beautiful to behold. Humans still can't produce something similar from scratch. Actually, they can't even be sure how the existing design works.

RainbowcityKun 1 days ago [-]
I want to clarify first: I'm not trying to defend "evolutionary theory" itself — what I'm pointing out is:

> Mutation, chaos, and randomness may actually be the fertile ground where biological diversity emerges.

At the same time, I fully agree with your key point:

> "The adaptive, complex, self-replicating systems we see > don’t persist just because of pure randomness."

In my view, this doesn’t necessarily mean a “God” designed it in a human-like way. But it does point to a deeper structural order and cosmic regularity.

Maybe we can call it a kind of “design of laws,” rather than a personal designer.

After all, nature seems to operate within a set of elegant, consistent rules:

- F = ma (Newton's 2nd Law): A foundational rule in classical mechanics. - E = mc² (Einstein): Energy and mass are interchangeable. - V = IR (Ohm’s Law): Governs how voltage, current, and resistance relate. - a² + b² = c² (Pythagorean Theorem): Geometry’s timeless backbone. - Entropy always increases (2nd Law of Thermodynamics): Order tends toward disorder unless something resists it.

So maybe we can say:

- In religious terms, this is “God’s design.” - In philosophical terms, it’s the “underlying order of the universe.” - In scientific terms, it’s the “laws of nature, structural stability, and the boundary conditions of evolution.”

kaibee 1 days ago [-]
> Godless evolution suggests randomness produced all of it overtime.

Nope. Randomness _and_ a selection function. Natural selection, ie: surviving to create the next generation.

> Yet, that's never worked in anything we've built.

It works completely fine in things we've built. We don't have the processing power to simulate something on the scale of computational complexity happening a small tide pond though. But you can see 'evolution by natural selection' in a rule set as simple as Game of Life.

> Even our GA's required laws, an environment, a computer, software, and fine-tuning. Pre-existing or by intelligent design (human inventors). Without these, it produced no results.

The laws/environment/computer are the equivalent of having a universe with physical laws. If you want to claim that god created the universe and tuned the constants of the universe, well, maybe. Or maybe every possible universe exists and we're just not around in the ones that don't lead to conscious life, in the same way that Game of Life universe is too simple/constrained to evolve conscious life on the scales we can simulate.

https://www.youtube.com/watch?v=vHb07ynsPgo

nickpsecurity 1 days ago [-]
"Randomness _and_ a selection function"

It takes more than that for the chemical bonds to form, for the encoding to exist, for the bootstrapping environments to form, for the transitions to happen, and so on. Also, if a selection function exists, where did it come from and why does it work? Why does the math work? Why isn't math less useful or changing constantly?

"But you can see 'evolution by natural selection' in a rule set as simple as Game of Life."

That's false. You're repeating the same false premises as in the original claim I refuted. If godless and random could do it, then the questions below would all be No.

Does the game run in an environment made by intelligent designers? Does that environment need to be maintained?

Does it require rules made and maintained by intelligent designers?

Does it take an initial state in those rules to get to the specific outcomes you are looking for?

Does it produce simple, temporary patterns that are useless? Or complex machinery that's actually useful?

Or did all of the above happen randomly, keep happening, and produce increasingly complex and useful things?

"Or maybe every possible universe exists"

Science starts from observations to produce hypotheses. That is a faith-based belief popular in science fiction. It's also sort of a cop out because they're going to imagine something as infinite as God, but not mention God, to hope this would pop out randomly. If one does, they still have the "maintain it with stability over long periods" problem for that or those universes. They'll probably drag it deeper into infinity to say it will finally happen accidentally. Let's do science instead.

What we observe is a universe that is highly chaotic, almost every cubic inch is deadly, and the safest places are dead. We see nothing happening from it with Earth and humans being mind-boggling exceptions. Looking deeper at classical physics, we find reality itself also emerges in an orderly fashion from endless, quantum events that should be too random to support order. It also appears to work perfectly without failure for long periods of time.

We've also observed countless phenomenon that are truly random and chaotic, like July 4th fireworks, which never produce life or complex machines. Never self-replicating artifacts whose complexity increases over time. Never emergent intelligence from anything that didn't show evidence of design or have human input. We have billions of observations of chaotic events which themselves sometimes have a high magnitude of particles, chemicals, etc. Also, nothing lasts on its own due to physics with our intelligent designs requiring maintenance over time.

Our first hypothesis is that our reality should be total chaos. Our second hypothesis is something with unimaginable power is forcing a specific order to consistently come out of chaos. Second hypothesis is that the universe doesn't support life without being forced to. Third hypothesis is an intelligent being went uphill against the deadly universe to create us and our planet. Fourth hypothesis is that being is sustaining us despite a whole universe of threats to our lives. Fifth is that the creator is perfect. God is the Occam's Razor explanation of all of this.

There's also revelatory knowledge. God revealed Himself to us via His Word which came with prophecies, miracles, and testable predictions about lifestyles. Jesus, who died for humanity's sins, had a perfect life on top of the same, other attributes. Neither nobody nor nothing else had these traits to support their claimed revelations. So, outside empirical knowledge, revelatory knowledge reinforces the God theory into a highly-proven, saving belief. The life transformations that follow add anecdotal evidence to it.

SalmoShalazar 1 days ago [-]
Really? Creationists on HN? There are mountains of peer reviewed research articles you can read to see that evolution is real and evidence based. To claim otherwise is idiocy.
nickpsecurity 9 hours ago [-]
Most top scientists were deists or Christians at one point. Newton's Principia Mathematica was even written to glorify God. Clearly, neither atheists nor evolutionists found the number of people making that claim to be good enough to ignore another claim.

Scientists tell us all ideas, whether a proposal or dissent, are evaluated strictly on evidential merit. Yet, evolution as origin of life had little evidence, many flaws, was forced on people anyway, and dissenting papers aren't allowed.

If it is dogmatic, and dissent isnt allowed, it is not science at all. Just a godless religion or political domination done with scientific wording in their papers. A consensus by people who force everyone to think one way isnt a scientific consensus. A theory whose rebuttals aren't even allowed in scientific journals isnt a scientific theory.

Until alternatives are allowed, and a real debate happens, I reject macro-evolution as either the truth or even a scientific consensus. I'll throw in some example counters, most being strong, which I wasn't taught in high school or college.

https://www.epm.org/resources/2010/Oct/3/ten-major-flaws-evo...

https://www.icr.org/article/four-scientific-reasons-that-ref...

jacktensuited 23 hours ago [-]
[dead]
gfalcao 1 days ago [-]
I would like to get a reasonably good intuition in regards to the total amount of compound DNA from human bodies at different biochemical states, in different locations around the world (different climates). By "compound DNA" I mean, including DNA of bacterium, fungi and viruses living within one's body. For instance, gut bacteria acquired and maintained based on food intake and environmental influence.
gfalcao 1 days ago [-]
In other words, how much the perception of DNA data in gigabytes grow by in different circumstances? Would it grow by a few more gigabytes ?
xvilka 22 hours ago [-]
If we add more nucleotides[1] than standard 4, we could encode much more.

[1] https://en.m.wikipedia.org/wiki/Xeno_nucleic_acid

2 days ago [-]
amelius 2 days ago [-]
Another question is:

How much information can you __store__ in DNA without affecting the organism too much?

timewizard 1 days ago [-]
Very little. The base pairs have specific electrochemical properties. The content of DNA controls it's structure.
roxolotl 2 days ago [-]
Discussion from earlier this week: https://news.ycombinator.com/item?id=43927321

Pretty sure the substack and main site are the same. First paragraph is at least.

timewizard 1 days ago [-]
> But mitochondrial DNA is tiny so I won’t mention it again.

Which is a bummer because it is circular. There is also a point on the strand where two separate genes overlap. The end of one has the same code as the beginning of another.

So even DNA has it's own native compression scheme.

metalman 2 days ago [-]
DNA contains all of the actualy relevant information that exists, including whatever sequence gives rise to the very conceptualisation of information, so in fact everything else that could be considered "information" is derived from DNA.
frshOffTheBoat 1 days ago [-]
Very science, thanks!!
nuc1e0n 2 days ago [-]
The article says that DNA is designed to keep working despite mutations occuring. What evidence does the author put forward to suppose it was designed rather than evolved? There's plenty of evidence to support it evolved BTW.
iamtheworstdev 2 days ago [-]
you might be reading a little too much into that word
hsshhshshjk 2 days ago [-]
And likely on purpose too
decremental 2 days ago [-]
[dead]
rhelz 3 days ago [-]
In any case, 6.2 billion bits (interestingly enough, almost exactly as much information which is on an audio CD which you used for your romantic mixtapes) is an upper bound.

This rules out pretty much every nutty theory which evolutionary psychologists propose. Such as we evolved for altruism, we evolved to believe in religion, etc etc. Complete B.S. Exactly how much information would you need to specify a behavior like being predisposed to a belief in religion??? There's less than 80 minutes worth of music's worth of information in our genomes, and most of that is concerned with just keeping us alive.

You are not predisposed to be anything. Go create the kind of person you want to be.

out_of_protocol 2 days ago [-]
> There's less than 80 minutes worth of music's worth of information

Or awful lot of text information (state of art compressors can do up to 1:10 ratio for plain text, decoder itself is rather small, 750MB compressed could potentially contain like 7GB of text data).

Also, look at demoscene. 4k (4 kB is the size of executable) can do crazy things, and 64kB can fit a lot of nice 3D objects, music, text, complex effects etc. weight less than any screenshot of any moment of running demo. In 95kB you can have full game (google kkringer)

P.S. better example: full snake game in 56 BYTES https://github.com/donno2048/snake

For comparation the link above is 34 bytes, whole sentence is 83 bytes. It's possible to do a lot if we're talking about code

bob1029 2 days ago [-]
> Also, look at demoscene. 4k (4 kB is the size of executable) can do crazy things

There are limits to how Kolmogorov complexity scales up. Many of these tricks are exploiting procedural techniques that can be expressed in minimal terms. Once you start feeding in actual information that is not feasible to express procedurally (i.e., is already compressed/high-entropy), you are forced to accumulate bits. An obvious example of this would be incorporating a texture that is multiple megabytes when compressed as a jpeg on disk.

out_of_protocol 1 days ago [-]
Evolution also uses dirty tricks all the time, for no reason. E.g. the same region gets reused for totally unrelated use-cases.

> An obvious example of this would be incorporating a texture

Some random range of storage data is now the texture. It was used to process formatting logic but now also a texture

Valgrim 2 days ago [-]
There's an interesting implication to this. We assume that evolution happens when random mutations (similar to random bit flips, removal or injection?) occur and when the random result has an advantage, the mutation tends to remain in the gene pool.

Yet at the same time the result of this random code is extremely compressed, to the point we compare it to procedural generative code.

Not sure what we can do with this but it certainly seems like we can once again get inspired by nature on this one.

robviren 2 days ago [-]
I'd argue you could even take that one step further. Limiting it to the data encoded by DNA does not take into account what it is interacting with. DNA interacts with an ocean of protein leading to untold numbers of interactions. The DNA could just be the operating system in all this calling upon RNA and other "devices" to execute functions.

To expand upon your compression idea, the index it is using exists outside the DNA encoding itself which means it could be holding an absolute ton of data.

Bonus: https://xkcd.com/3056/

EvanAnderson 1 days ago [-]
> The DNA could just be the operating system...

I am fond of the analogy of DNA to procedural generation. The "operating system", as I see it, is physics. Everything else is primitives built on top of that.

Our brains can't begin to comprehend the untold multitudes of interactions occurring at a molecular scale over geologic time.

bossyTeacher 2 days ago [-]
Indeed. Chances are that the DNA itself is but one part of the puzzle. The protein soup the DNA interacts with is partially random and partially a consequence of the DNA itself and that interplay is likely a complexity space several orders of magnitude bigger than the DNA itself
ruuda 2 days ago [-]
> There's less than 80 minutes worth of music's worth of information in our genomes

That’s a very misleading take, this is lossless audio and the majority of the bits are spent encoding noise. You can encode way more audio at perceptually but not technically lossless level in that space.

guilbep 2 days ago [-]
There is no logic behind your argument
chromatin 2 days ago [-]
> There's less than 80 minutes worth of music's worth of information in our genomes

What an insanely bad take.

Not only did you not read and/or comprehend the article, the article itself undersells the information content of the genome (I'll post on this at the top level).

> You are not predisposed to be anything.

This does not logically follow your preceding statement, even if we were to accept the foregoing limited information content as factual

nathan_compton 2 days ago [-]
This isn't a great argument - simple rules can produce complicated behavior and, at any rate, I don't think any evpsych people believe that evolution inescapably predisposes people to the things you talked about, only that evolution has produced biases in our behavior which manifest (at certain times and in certain circumstances) as those phenomena.
GuB-42 2 days ago [-]
An audio CD is a very inefficient way of storing information.

I think a more apt comparison would be that of a LLM of that size. qwen:0.5b is about 400MB, its abilities are laughable compared to the likes of ChatGPT, but it can write coherently about general topics. For instance.

  >>> why would people be altruistic
  People are likely to be altruistic because they believe that helping others is better for everyone involved.
  People may also believe in the power of compassion and empathy towards others, which can contribute to greater altruism.
  Overall, people are likely to be altruistic because they believe that helping others is better for everyone involved.
It is not a statement about LLMs, more about what you can achieve with "just" 400MB for storage. The other similarity is that LLMs are also "messy", if you want to see the results of finely crafted work in a really small amount of space, look at what sizecoders can do with a few kB of code or less.
nurettin 2 days ago [-]
You are predisposed to acting like your closest social circle.