to demonstrate that the device remains stable even after 60,000s
A little over 16 hours? That's suspiciously short. The endurance vs retention curve isn't clear from this article either; they say "10 years" and "5.5 million cycles" but it seems more like you either get 10 years and 1 cycle, or 5.5M cycles to immediate failure with no regard to retention.
It reminds me of this old paper on testing USB drives for endurance, where they just hammered at the flash until it failed to program immediately and "concluded" that the endurance was many orders of magnitude higher than the manufacturer's specifications, with no attention paid to retention at all: https://www.usenix.org/event/fast10/tech/full_papers/boboila...
> 60,000 seconds was the amount of time it was tested for,
Surely if they already have a test setup, then having a test last for 600,000 seconds isn't very hard?
Things that look linear for a short period end up being exponential over longer periods. I don't think we can assume linear extrapolation here. There could be physics at play where exponential degeneration of the voltage occurs.
Its a good start of a test. But it seems weird in that a paper like this would have taken much more than ~1 week to write, so making a test last ~1 week for their calculations seems within the feasibility of this group. But its oddly missing data.
DRAM will also normally hold most of its data for 1000x longer than the rated (usually 60 ms) refresh time. This has sometimes been used to recover secrets from powered down computers. The rated refresh time is chosen to give near-zero errors over years of operation, accounting for worse-case leakage from any bit, but most bits leak much less than that.
Assume the memory is instant. We still need to communicate with it across physical distance. How far away the memories are in space is way more critical than the speed of any one element in isolation.
Why are we constrained to such a relatively small amount of L1 cache? What would stop us from extending this arbitrarily?
HP was the only CPU vendor I recall that went for massive L1 with their PA-RISC chips, some with 1-2MB of L1. I'm going to say a large L1 is ~1MB for this comment.
There are power, speed, and complexity trade-offs in cache design. Here were a few of them:
Direct-mapping is the simplest approach, and means a given address can only exist in one location, but a problem occurs when two addresses map to the same cache line - one is evicted even if there's plenty of space in the cache elsewhere.
What if we built an associative cache, where every line had an address indicator? Then we can fully use the cache space. But it's far more complicated to search: a miss requires checking every cacheline. If it's fast, so does a hit.
Many systems today use a mix. Smaller caches are often direct mapped. Larger caches tend to use a combination of 2-8 direct-mapped caches where an address can be searched in at the same time, or within a few cycles of each other.
Another problem is evictions becoming a future cache miss. With only a large L1, a cacheline was either in fast L1 or in DRAM. There's often a write buffer or victim cache between them to try to hide dirty eviction latency, but a subsequent fetch will pay the DRAM access cost. As we scale the cache size up, L1 access speed becomes challenging and eventually it's more effective to limit the L1, and build a even larger, slower L2, and then we get the advantage that L2 is still faster than DRAM and we can prefetch into it.
This cache hierarchy tends to fit typical access patterns better as well - for many workloads most accesses will tend to be clumped together. For streaming workloads like video processing that won't fit in a L1 cache anyway, the algorithms are usually aware of row/column/striding impacts on cache utilization already.
> How far away the memories are in space is way more critical than the speed of any one element in isolation.
Correct me if I'm wrong, but I don't think this is true. If you put instant memory a whole meter away it'll still only have a round trip of 6.6ns at the speed of light, which is approximately the latency of L2. Given how close L2 is, I don't think distance is a large factor of its latency.
My guess is they are thinking of it more as a replacement for dram or disk. I didn't read far enough to learn if it failed after all those cycles or just "we stopped testing it". Either way it sounds promising.
to demonstrate that the device remains stable even after 60,000s
A little over 16 hours? That's suspiciously short. The endurance vs retention curve isn't clear from this article either; they say "10 years" and "5.5 million cycles" but it seems more like you either get 10 years and 1 cycle, or 5.5M cycles to immediate failure with no regard to retention.
It reminds me of this old paper on testing USB drives for endurance, where they just hammered at the flash until it failed to program immediately and "concluded" that the endurance was many orders of magnitude higher than the manufacturer's specifications, with no attention paid to retention at all: https://www.usenix.org/event/fast10/tech/full_papers/boboila...
It's more like DRAM with a much longer refresh time (60 ks instead of 60 ms).
60,000 seconds was the amount of time it was tested for, they then extrapolate that out linearly. It doesn't need refreshing that often.
> 60,000 seconds was the amount of time it was tested for,
Surely if they already have a test setup, then having a test last for 600,000 seconds isn't very hard?
Things that look linear for a short period end up being exponential over longer periods. I don't think we can assume linear extrapolation here. There could be physics at play where exponential degeneration of the voltage occurs.
Its a good start of a test. But it seems weird in that a paper like this would have taken much more than ~1 week to write, so making a test last ~1 week for their calculations seems within the feasibility of this group. But its oddly missing data.
DRAM will also normally hold most of its data for 1000x longer than the rated (usually 60 ms) refresh time. This has sometimes been used to recover secrets from powered down computers. The rated refresh time is chosen to give near-zero errors over years of operation, accounting for worse-case leakage from any bit, but most bits leak much less than that.
Even if it did have a 16 hour retention, this memory would have plenty of uses.
Adding a couple of percent of ECC data tends to 10x retention anyway, so there is a direct engineering trade off between retention and capacity.
Assume the memory is instant. We still need to communicate with it across physical distance. How far away the memories are in space is way more critical than the speed of any one element in isolation.
Why are we constrained to such a relatively small amount of L1 cache? What would stop us from extending this arbitrarily?
HP was the only CPU vendor I recall that went for massive L1 with their PA-RISC chips, some with 1-2MB of L1. I'm going to say a large L1 is ~1MB for this comment.
There are power, speed, and complexity trade-offs in cache design. Here were a few of them:
Direct-mapping is the simplest approach, and means a given address can only exist in one location, but a problem occurs when two addresses map to the same cache line - one is evicted even if there's plenty of space in the cache elsewhere.
What if we built an associative cache, where every line had an address indicator? Then we can fully use the cache space. But it's far more complicated to search: a miss requires checking every cacheline. If it's fast, so does a hit.
Many systems today use a mix. Smaller caches are often direct mapped. Larger caches tend to use a combination of 2-8 direct-mapped caches where an address can be searched in at the same time, or within a few cycles of each other.
Another problem is evictions becoming a future cache miss. With only a large L1, a cacheline was either in fast L1 or in DRAM. There's often a write buffer or victim cache between them to try to hide dirty eviction latency, but a subsequent fetch will pay the DRAM access cost. As we scale the cache size up, L1 access speed becomes challenging and eventually it's more effective to limit the L1, and build a even larger, slower L2, and then we get the advantage that L2 is still faster than DRAM and we can prefetch into it.
This cache hierarchy tends to fit typical access patterns better as well - for many workloads most accesses will tend to be clumped together. For streaming workloads like video processing that won't fit in a L1 cache anyway, the algorithms are usually aware of row/column/striding impacts on cache utilization already.
There's probably more to consider, like SMP.
> How far away the memories are in space is way more critical than the speed of any one element in isolation.
Correct me if I'm wrong, but I don't think this is true. If you put instant memory a whole meter away it'll still only have a round trip of 6.6ns at the speed of light, which is approximately the latency of L2. Given how close L2 is, I don't think distance is a large factor of its latency.
The problem is resistance and parasitic capacitance scales with wire length, e^(-t/RC) is going to limit your max frequency.
L1 cache has fast access because it is small. If you make it larger, you necessarily make it slower to access.
Recent discussion:
Researchers develop picosecond-level flash memory device (19.04.2025)
https://news.ycombinator.com/item?id=43735452
> The Dirac channel flash shows a program speed of 400 picoseconds, non-volatile storage and robust endurance over 5.5 × 10^6 cycles.
So, about 2ms of use? That lifetime seems like this is far from practical as a replacement for SRAM.
The cycles are programming cycles, not clock cycles. Their stability measurement was 60,000 seconds.
Would it be lower power than DRAM?
Yes!
If it's low power you can stack it higher without worrying so much about it melting. This might make a great high capacity cache.
My guess is they are thinking of it more as a replacement for dram or disk. I didn't read far enough to learn if it failed after all those cycles or just "we stopped testing it". Either way it sounds promising.
Can it be used as FPGA config RAM? Also, maybe for faster SSDs?
I think they mean write-read cycles.