From d2bf2bbc72f683fff721659f5a57202286bf24d6 Mon Sep 17 00:00:00 2001 From: Christopher Haster Date: Thu, 18 May 2017 23:44:18 -0500 Subject: Added documentation over the underlying design --- DESIGN.md | 973 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 973 insertions(+) create mode 100644 DESIGN.md (limited to 'DESIGN.md') diff --git a/DESIGN.md b/DESIGN.md new file mode 100644 index 0000000..0569136 --- /dev/null +++ b/DESIGN.md @@ -0,0 +1,973 @@ +## The design of the little filesystem + +The littlefs is a little fail-safe filesystem designed for embedded systems. + +``` + | | | .---._____ + .-----. | | +--|o |---| littlefs | +--| |---| | + '-----' '----------' + | | | +``` + +For a bit of backstory, the littlefs was developed with the goal of learning +more about filesystem design by tackling the relative unsolved problem of +managing a robust filesystem resilient to power loss on devices +with limited RAM and ROM. + +The embedded systems the littlefs is targeting are usually 32bit +microcontrollers with around 32Kbytes of RAM and 512Kbytes of ROM. These are +often paired with SPI NOR flash chips with about 4Mbytes of flash storage. + +Flash itself is a very interesting piece of technology with quite a bit of +nuance. Unlike most other forms of storage, writing to flash requires two +operations: erasing and programming. The programming operation is relatively +cheap, and can be very granular. For NOR flash specifically, byte-level +programs are quite common. Erasing, however, requires an expensive operation +that forces the state of large blocks of memory to reset in a destructive +reaction that gives flash its name. The [Wikipedia entry](https://en.wikipedia.org/wiki/Flash_memory) +has more information if you are interesting in how this works. + +This leaves us with an interesting set of limitations that can be simplified +to three strong requirements: + +1. **Fail-safe** - This is actually the main goal of the littlefs and the focus + of this project. Embedded systems are usually designed without a shutdown + routine and a notable lack of user interface for recovery, so filesystems + targeting embedded systems should be prepared to lose power an any given + time. + + Despite this state of things, there are very few embedded filesystems that + handle power loss in a reasonable manner, and can become corrupted if the + user is unlucky enough. + +2. **Wear awareness** - Due to the destructive nature of flash, most flash + chips have a limited number of erase cycles, usually in the order of around + 100,000 erases per block for NOR flash. Filesystems that don't take wear + into account can easily burn through blocks used to store frequently updated + metadata. + + Consider the [FAT filesystem](https://en.wikipedia.org/wiki/Design_of_the_FAT_file_system), + which stores a file allocation table (FAT) at a specific offset from the + beginning of disk. Every block allocation will update this table, and after + 100,000 updates, the block will likely go bad, rendering the filesystem + unusable even if there are many more erase cycles available on the storage. + +3. **Bounded RAM/ROM** - Even with the design difficulties presented by the + previous two limitations, we have already seen several flash filesystems + developed on PCs that handle power loss just fine, such as the + logging filesystems. However, these filesystems take advantage of the + relatively cheap access to RAM, and use some rather... opportunistic... + techniques, such as reconstructing the entire directory structure in RAM. + These operations make perfect sense when the filesystem's only concern is + erase cycles, but the idea is a bit silly on embedded systems. + + To cater to embedded systems, the littlefs has the simple limitation of + using only a bounded amount of RAM and ROM. That is, no matter what is + written to the filesystem, and no matter how large the underlying storage + is, the littlefs will always use the same amount of RAM and ROM. This + presents a very unique challenge, and makes presumably simple operations, + such as iterating through the directory tree, surprisingly difficult. + +## Existing designs? + +There are of course, many different existing filesystem. Heres a very rough +summary of the general ideas behind some of them. + +Most of the existing filesystems fall into the one big category of filesystem +designed in the early days of spinny magnet disks. While there is a vast amount +of interesting technology and ideas in this area, the nature of spinny magnet +disks encourage properties such as grouping writes near each other, that don't +make as much sense on recent storage types. For instance, on flash, write +locality is not as important and can actually increase wear destructively. + +One of the most popular designs for flash filesystems is called the +[logging filesystem](https://en.wikipedia.org/wiki/Log-structured_file_system). +The flash filesystems [jffs](https://en.wikipedia.org/wiki/JFFS) +and [yaffs](https://en.wikipedia.org/wiki/YAFFS) are good examples. In +logging filesystem, data is not store in a data structure on disk, but instead +the changes to the files are stored on disk. This has several neat advantages, +such as the fact that the data is written in a cyclic log format naturally +wear levels as a side effect. And, with a bit of error detection, the entire +filesystem can easily be designed to be resilient to power loss. The +journalling component of most modern day filesystems is actually a reduced +form of a logging filesystem. However, logging filesystems have a difficulty +scaling as the size of storage increases. And most filesystems compensate by +caching large parts of the filesystem in RAM, a strategy that is unavailable +for embedded systems. + +Another interesting filesystem design technique that the littlefs borrows the +most from, is the [copy-on-write (COW)](https://en.wikipedia.org/wiki/Copy-on-write). +A good example of this is the [btrfs](https://en.wikipedia.org/wiki/Btrfs) +filesystem. COW filesystems can easily recover from corrupted blocks and have +natural protection against power loss. However, if they are not designed with +wear in mind, a COW filesystem could unintentionally wear down the root block +where the COW data structures are synchronized. + +## Metadata pairs + +The core piece of technology that provides the backbone for the littlefs is +the concept of metadata pairs. The key idea here, is that any metadata that +needs to be updated atomically is stored on a pair of blocks tagged with +a revision count and checksum. Every update alternates between these two +pairs, so that at any time there is always a backup containing the previous +state of the metadata. + +Consider a small example where each metadata pair has a revision count, +a number as data, and the xor of the block as a quick checksum. If +we update the data to a value of 9, and then to a value of 5, here is +what the pair of blocks may look like after each update: +``` + block 1 block 2 block 1 block 2 block 1 block 2 +.---------.---------. .---------.---------. .---------.---------. +| rev: 1 | rev: 0 | | rev: 1 | rev: 2 | | rev: 3 | rev: 2 | +| data: 3 | data: 0 | -> | data: 3 | data: 9 | -> | data: 5 | data: 9 | +| xor: 2 | xor: 0 | | xor: 2 | xor: 11 | | xor: 6 | xor: 11 | +'---------'---------' '---------'---------' '---------'---------' + let data = 9 let data = 5 +``` + +After each update, we can find the most up to date value of data by looking +at the revision count. + +Now consider what the blocks may look like if we suddenly loss power while +changing the value of data to 5: +``` + block 1 block 2 block 1 block 2 block 1 block 2 +.---------.---------. .---------.---------. .---------.---------. +| rev: 1 | rev: 0 | | rev: 1 | rev: 2 | | rev: 3 | rev: 2 | +| data: 3 | data: 0 | -> | data: 3 | data: 9 | -x | data: 3 | data: 9 | +| xor: 2 | xor: 0 | | xor: 2 | xor: 11 | | xor: 2 | xor: 11 | +'---------'---------' '---------'---------' '---------'---------' + let data = 9 let data = 5 + powerloss!!! +``` + +In this case, block 1 was partially written with a new revision count, but +the littlefs hadn't made it to updating the value of data. However, if we +check our checksum we notice that block 1 was corrupted. So we fall back to +block 2 and use the value 9. + +Using this concept, the littlefs is able to update metadata blocks atomically. +There are a few other tweaks, such as using a 32bit crc and using sequence +arithmetic to handle revision count overflow, but the basic concept +is the same. These metadata pairs define the backbone of the littlefs, and the +rest of the filesystem is built on top of these atomic updates. + +## Files + +Now, the metadata pairs do come with some drawbacks. Most notably, each pair +requires two blocks for each block of data. I'm sure users would be very +unhappy if their storage was suddenly cut in half! Instead of storing +everything in these metadata blocks, the littlefs uses a COW data structure +for files which is in turn pointed to by a metadata block. When +we update a file, we create a copies of any blocks that are modified until +the metadata blocks are updated with the new copy. Once the metadata block +points to the new copy, we deallocate the old blocks that are no longer in use. + +Here is what updating a one-block file may look like: +``` + block 1 block 2 block 1 block 2 block 1 block 2 +.---------.---------. .---------.---------. .---------.---------. +| rev: 1 | rev: 0 | | rev: 1 | rev: 0 | | rev: 1 | rev: 2 | +| file: 4 | file: 0 | -> | file: 4 | file: 0 | -> | file: 4 | file: 5 | +| xor: 5 | xor: 0 | | xor: 5 | xor: 0 | | xor: 5 | xor: 7 | +'---------'---------' '---------'---------' '---------'---------' + | | | + v v v + block 4 block 4 block 5 block 4 block 5 +.--------. .--------. .--------. .--------. .--------. +| old | | old | | new | | old | | new | +| data | | data | | data | | data | | data | +| | | | | | | | | | +'--------' '--------' '--------' '--------' '--------' + update data in file update metadata pair +``` + +It doesn't matter if we lose power while writing block 5 with the new data, +since the old data remains unmodified in block 4. This example also +highlights how the atomic updates of the metadata blockss provide a +synchronization barrier for the rest of the littlefs. + +At this point, it may look like we are wasting an awfully large amount +of space on the metadata. Just looking at that example, we are using +three blocks to represent a file that fits comfortably in one! So instead +of giving each file a metadata pair, we actually store the metadata for +all files contained in a single directory in a single metadata block. + +Now we could just leave files here, copying the entire file on write +provides the synchronization without the duplicated memory requirements +of the metadata blocks. However, we can do a bit better. + +## CTZ linked-lists + +There are many different data structures for representing the actual +files in filesystems. Of these, the littlefs uses a rather unique [COW](https://upload.wikimedia.org/wikipedia/commons/0/0c/Cow_female_black_white.jpg) +data structure that allows the filesystem to reuse unmodified parts of the +file without additional metadata pairs. + +First lets consider storing files in a simple linked-list. What happens when +append a block? We have to change the last block in the linked-list to point +to this new block, which means we have to copy out the last block, and change +the second-to-last block, and then the third-to-last, and so on until we've +copied out the entire file. + +``` +Exhibit A: A linked-list +.--------. .--------. .--------. .--------. .--------. .--------. +| data 0 |->| data 1 |->| data 2 |->| data 4 |->| data 5 |->| data 6 | +| | | | | | | | | | | | +| | | | | | | | | | | | +'--------' '--------' '--------' '--------' '--------' '--------' +``` + +To get around this, the littlefs, at its heart, stores files backwards. Each +block points to its predecessor, with the first block containing no pointers. +If you think about this, it makes a bit of sense. Appending blocks just point +to their predecessor and no other blocks need to be updated. If we update +a block in the middle, we will need to copy out the blocks that follow, +but can reuse the blocks before the modified block. Since most file operations +either reset the file each write or append to files, this design avoids +copying the file in the most common cases. + +``` +Exhibit B: A backwards linked-list +.--------. .--------. .--------. .--------. .--------. .--------. +| data 0 |<-| data 1 |<-| data 2 |<-| data 4 |<-| data 5 |<-| data 6 | +| | | | | | | | | | | | +| | | | | | | | | | | | +'--------' '--------' '--------' '--------' '--------' '--------' +``` + +However, a backwards linked-list does come with a rather glaring problem. +Iterating over a file _in order_ has a runtime of O(n^2). Gah! A quadratic +runtime to just _read_ a file? That's awful. Keep in mind reading files are +usually the most common filesystem operation. + +To avoid this problem, the littlefs uses a multilayered linked-list. For +every block that is divisible by a power of two, the block contains an +additional pointer that points back by that power of two. Another way of +thinking about this design is that there are actually many linked-lists +threaded together, with each linked-lists skipping an increasing number +of blocks. If you're familiar with data-structures, you may have also +recognized that this is a deterministic skip-list. + +To find the power of two factors efficiently, we can use the instruction +[count trailing zeros (CTZ)](https://en.wikipedia.org/wiki/Count_trailing_zeros), +which is where this linked-list's name comes from. + +``` +Exhibit C: A backwards CTZ linked-list +.--------. .--------. .--------. .--------. .--------. .--------. +| data 0 |<-| data 1 |<-| data 2 |<-| data 3 |<-| data 4 |<-| data 5 | +| |<-| |--| |<-| |--| | | | +| |<-| |--| |--| |--| | | | +'--------' '--------' '--------' '--------' '--------' '--------' +``` + +Taking exhibit C for example, here is the path from data block 5 to data +block 1. You can see how data block 3 was completely skipped: +``` +.--------. .--------. .--------. .--------. .--------. .--------. +| data 0 | | data 1 |<-| data 2 | | data 3 | | data 4 |<-| data 5 | +| | | | | |<-| |--| | | | +| | | | | | | | | | | | +'--------' '--------' '--------' '--------' '--------' '--------' +``` + +The path to data block 0 is even more quick, requiring only two jumps: +``` +.--------. .--------. .--------. .--------. .--------. .--------. +| data 0 | | data 1 | | data 2 | | data 3 | | data 4 |<-| data 5 | +| | | | | | | | | | | | +| |<-| |--| |--| |--| | | | +'--------' '--------' '--------' '--------' '--------' '--------' +``` + +The CTZ linked-list has quite a few interesting properties. All of the pointers +in the block can be found by just knowing the index in the list of the current +block, and, with a bit of math, the amortized overhead for the linked-list is +only two pointers per block. Most importantly, the CTZ linked-list has a +worst case lookup runtime of O(logn), which brings the runtime of reading a +file down to O(n logn). Given that the constant runtime is divided by the +amount of data we can store in a block, this is pretty reasonable. + +Here is what it might look like to update a file stored with a CTZ linked-list: +``` + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 0 | + | file: 6 | file: 0 | + | size: 4 | xor: 0 | + | xor: 3 | xor: 0 | + '---------'---------' + | + v + block 3 block 4 block 5 block 6 +.--------. .--------. .--------. .--------. +| data 0 |<-| data 1 |<-| data 2 |<-| data 3 | +| |<-| |--| | | | +| | | | | | | | +'--------' '--------' '--------' '--------' + +| update data in file +v + + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 0 | + | file: 6 | file: 0 | + | size: 4 | size: 0 | + | xor: 3 | xor: 0 | + '---------'---------' + | + v + block 3 block 4 block 5 block 6 +.--------. .--------. .--------. .--------. +| data 0 |<-| data 1 |<-| old |<-| old | +| |<-| |--| data 2 | | data 3 | +| | | | | | | | +'--------' '--------' '--------' '--------' + ^ ^ ^ + | | | block 7 block 8 block 9 block 10 + | | | .--------. .--------. .--------. .--------. + | | '----| new |<-| new |<-| new |<-| new | + | '----------------| data 2 |<-| data 3 |--| data 4 | | data 5 | + '------------------| |--| |--| | | | + '--------' '--------' '--------' '--------' + +| update metadata pair +v + + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 2 | + | file: 6 | file: 10| + | size: 4 | size: 6 | + | xor: 3 | xor: 14 | + '---------'---------' + | + | + block 3 block 4 block 5 block 6 | +.--------. .--------. .--------. .--------. | +| data 0 |<-| data 1 |<-| old |<-| old | | +| |<-| |--| data 2 | | data 3 | | +| | | | | | | | | +'--------' '--------' '--------' '--------' | + ^ ^ ^ v + | | | block 7 block 8 block 9 block 10 + | | | .--------. .--------. .--------. .--------. + | | '----| new |<-| new |<-| new |<-| new | + | '----------------| data 2 |<-| data 3 |--| data 4 | | data 5 | + '------------------| |--| |--| | | | + '--------' '--------' '--------' '--------' +``` + +## Block allocation + +So those two ideas provide the grounds for the filesystem. The metadata pairs +give us directories, and the CTZ linked-lists give us files. But this leaves +one big [elephant](https://upload.wikimedia.org/wikipedia/commons/3/37/African_Bush_Elephant.jpg) +of a question. How do we get those blocks in the first place? + +One common strategy is to store unallocated blocks in a big free list, and +initially the littlefs was designed with this in mind. By storing a reference +to the free list in every single metadata pair, additions to the free list +could be updated atomically at the same time the replacement blocks were +stored in the metadata pair. During boot, every metadata pair had to be +scanned to find the most recent free list, but once the list was found the +state of all free blocks becomes known. + +However, this approach had several issues: +- There was a lot of nuanced logic for adding blocks to the free list without + modifying the blocks, since the blocks remain active until the metadata is + updated. +- The free list had to support both additions and removals in fifo order while + minimizing block erases. +- The free list had to handle the case where the file system completely ran + out of blocks and may no longer be able to add blocks to the free list. +- If we used a revision count to track the most recently updated free list, + metadata blocks that were left unmodified were ticking time bombs that would + cause the system to go haywire if the revision count overflowed +- Every single metadata block wasted space to store these free list references. + +Actually, to simplify, this approach had one massive glaring issue: complexity. + +> Complexity leads to fallibility. +> Fallibility leads to unmaintainability. +> Unmaintainability leads to suffering. + +Or at least, complexity leads to increased code size, which is a problem +for embedded systems. + +In the end, the littlefs adopted more of a "drop it on the floor" strategy. +That is, the littlefs doesn't actually store information about which blocks +are free on the storage. The littlefs already stores which files _are_ in +use, so to find a free block, the littlefs just takes all of the blocks that +exist and subtract the blocks that are in use. + +Of course, it's not quite that simple. Most filesystems that adopt this "drop +it on the floor" strategy either rely on some properties inherent to the +filesystem, such as the cyclic-buffer structure of logging filesystems, +or use a bitmap or table stored in RAM to track free blocks, which scales +with the size of storage and is problematic when you have limited RAM. You +could iterate through every single block in storage and check it against +every single block in the filesystem on every single allocation, but that +would have an abhorrent runtime. + +So the littlefs compromises. It doesn't store a bitmap the size of the storage, +but it does store a little bit-vector that contains a fixed set lookahead +for block allocations. During a block allocation, the lookahead vector is +checked for any free blocks, if there are none, the lookahead region jumps +forward and the entire filesystem is scanned for free blocks. + +Here's what it might look like to allocate 4 blocks on a decently busy +filesystem with a 32bit lookahead and a total of +128 blocks (512Kbytes of storage if blocks are 4Kbyte): +``` +boot... lookahead: + fs blocks: fffff9fffffffffeffffffffffff0000 +scanning... lookahead: fffff9ff + fs blocks: fffff9fffffffffeffffffffffff0000 +alloc = 21 lookahead: fffffdff + fs blocks: fffffdfffffffffeffffffffffff0000 +alloc = 22 lookahead: ffffffff + fs blocks: fffffffffffffffeffffffffffff0000 +scanning... lookahead: fffffffe + fs blocks: fffffffffffffffeffffffffffff0000 +alloc = 63 lookahead: ffffffff + fs blocks: ffffffffffffffffffffffffffff0000 +scanning... lookahead: ffffffff + fs blocks: ffffffffffffffffffffffffffff0000 +scanning... lookahead: ffffffff + fs blocks: ffffffffffffffffffffffffffff0000 +scanning... lookahead: ffff0000 + fs blocks: ffffffffffffffffffffffffffff0000 +alloc = 112 lookahead: ffff8000 + fs blocks: ffffffffffffffffffffffffffff8000 +``` + +While this lookahead approach still has an asymptotic runtime of O(n^2) to +scan all of storage, the lookahead reduces the practical runtime to a +reasonable amount. Bit-vectors are surprisingly compact, given only 16 bytes, +the lookahead could track 128 blocks. For a 4Mbyte flash chip with 4Kbyte +blocks, the littlefs would only need 8 passes to scan the entire storage. + +The real benefit of this approach is just how much it simplified the design +of the littlefs. Deallocating blocks is as simple as simply forgetting they +exist, and there is absolutely no concern of bugs in the deallocation code +causing difficult to detect memory leaks. + +## Directories + +Now we just need directories to store our files. Since we already have +metadata blocks that store information about files, lets just use these +metadata blocks as the directories. Maybe turn the directories into linked +lists of metadata blocks so it isn't limited by the number of files that fit +in a single block. Add entries that represent other nested directories. +Drop "." and ".." entries, cause who needs them. Dust off our hands and +we now have a directory tree. + +``` + .--------. + |root dir| + | pair 0 | + | | + '--------' + .-' '-------------------------. + v v + .--------. .--------. .--------. + | dir A |------->| dir A | | dir B | + | pair 0 | | pair 1 | | pair 0 | + | | | | | | + '--------' '--------' '--------' + .-' '-. | .-' '-. + v v v v v +.--------. .--------. .--------. .--------. .--------. +| file C | | file D | | file E | | file F | | file G | +| | | | | | | | | | +| | | | | | | | | | +'--------' '--------' '--------' '--------' '--------' +``` + +Unfortunately it turns out it's not that simple. See, iterating over a +directory tree isn't actually all that easy, especially when you're trying +to fit in a bounded amount of RAM, which rules out any recursive solution. +And since our block allocator involves iterating over the entire filesystem +tree, possibly multiple times in a single allocation, iteration needs to be +efficient. + +So, as a solution, the littlefs adopted a sort of threaded tree. Each +directory not only contains pointers to all of its children, but also a +pointer to the next directory. These pointers create a linked-list that +is threaded through all of the directories in the filesystem. Since we +only use this linked list to check for existance, the order doesn't actually +matter. As an added plus, we can repurpose the pointer for the individual +directory linked-lists and avoid using any additional space. + +``` + .--------. + |root dir|-. + | pair 0 | | + .--------| |-' + | '--------' + | .-' '-------------------------. + | v v + | .--------. .--------. .--------. + '->| dir A |------->| dir B |------->| dir B | + | pair 0 | | pair 1 | | pair 0 | + | | | | | | + '--------' '--------' '--------' + .-' '-. | .-' '-. + v v v v v +.--------. .--------. .--------. .--------. .--------. +| file C | | file D | | file E | | file F | | file G | +| | | | | | | | | | +| | | | | | | | | | +'--------' '--------' '--------' '--------' '--------' +``` + +This threaded tree approach does come with a few tradeoffs. Now, anytime we +want to manipulate the directory tree, we find ourselves having to update two +pointers instead of one. For anyone familiar with creating atomic data +structures this should set off a whole bunch of red flags. + +But unlike the data structure guys, we can update a whole block atomically! So +as long as we're really careful (and cheat a little bit), we can still +manipulate the directory tree in a way that is resilient to power loss. + +Consider how we might add a new directory. Since both pointers that reference +it can come from the same directory, we only need a single atomic update to +finagle the directory into the filesystem: +``` + .--------. + |root dir|-. + | pair 0 | | +.--| |-' +| '--------' +| | +| v +| .--------. +'->| dir A | + | pair 0 | + | | + '--------' + +| create the new directory block +v + + .--------. + |root dir|-. + | pair 0 | | + .--| |-' + | '--------' + | | + | v + | .--------. +.--------. '->| dir A | +| dir B |---->| pair 0 | +| pair 0 | | | +| | '--------' +'--------' + +| update root to point to directory B +v + + .--------. + |root dir|-. + | pair 0 | | +.--------| |-' +| '--------' +| .-' '-. +| v v +| .--------. .--------. +'->| dir B |->| dir A | + | pair 0 | | pair 0 | + | | | | + '--------' '--------' +``` + +Note that even though directory B was added after directory A, we insert +directory B before directory A in the linked-list because it is convenient. + +Now how about removal: +``` + .--------. .--------. + |root dir|------->|root dir|-. + | pair 0 | | pair 1 | | +.--------| |--------| |-' +| '--------' '--------' +| .-' '-. | +| v v v +| .--------. .--------. .--------. +'->| dir A |->| dir B |->| dir C | + | pair 0 | | pair 0 | | pair 0 | + | | | | | | + '--------' '--------' '--------' + +| update root to no longer contain directory B +v + + .--------. .--------. + |root dir|------------->|root dir|-. + | pair 0 | | pair 1 | | +.--| |--------------| |-' +| '--------' '--------' +| | | +| v v +| .--------. .--------. .--------. +'->| dir A |->| dir B |->| dir C | + | pair 0 | | pair 0 | | pair 0 | + | | | | | | + '--------' '--------' '--------' + +| remove directory B from the linked-list +v + + .--------. .--------. + |root dir|->|root dir|-. + | pair 0 | | pair 1 | | +.--| |--| |-' +| '--------' '--------' +| | | +| v v +| .--------. .--------. +'->| dir A |->| dir C | + | pair 0 | | pair 0 | + | | | | + '--------' '--------' +``` + +Wait, wait, wait, wait, wait, that's not atomic at all! If power is lost after +removing directory B from the root, directory B is still in the linked-list. +We've just created a memory leak! + +And to be honest, I don't have a clever solution for this case. As a +side-effect of using multiple pointers in the threaded tree, the littlefs +can end up with orphan blocks that have no parents and should have been +removed. + +To keep these orphan blocks from becoming a problem, the littlefs has a +deorphan step that simply iterates through every directory in the linked-list +and checks it against every directory entry in the filesystem to see if it +has a parent. The deorphan step occurs on the first block allocation after +boot, so orphans should never cause the littlefs to run out of storage +prematurely. + +And for my final trick, moving a directory: +``` + .--------. + |root dir|-. + | pair 0 | | +.--------| |-' +| '--------' +| .-' '-. +| v v +| .--------. .--------. +'->| dir A |->| dir B | + | pair 0 | | pair 0 | + | | | | + '--------' '--------' + +| update directory B to point to directory A +v + + .--------. + |root dir|-. + | pair 0 | | +.--------| |-' +| '--------' +| .-----' '-. +| | v +| | .--------. +| | .->| dir B | +| | | | pair 0 | +| | | | | +| | | '--------' +| | .-------' +| v v | +| .--------. | +'->| dir A |-' + | pair 0 | + | | + '--------' + +| update root to no longer contain directory A +v + .--------. + |root dir|-. + | pair 0 | | +.----| |-' +| '--------' +| | +| v +| .--------. +| .->| dir B | +| | | pair 0 | +| '--| |-. +| '--------' | +| | | +| v | +| .--------. | +'--->| dir A |-' + | pair 0 | + | | + '--------' +``` + +Note that once again we don't care about the ordering of directories in the +linked-list, so we can simply leave directories in their old positions. This +does make the diagrams a bit hard to draw, but the littlefs doesn't really +care. + +It's also worth noting that once again we have an operation that isn't actually +atomic. After we add directory A to directory B, we could lose power, leaving +directory A as a part of both the root directory and directory B. However, +there isn't anything inherent to the littlefs that prevents a directory from +having multiple parents, so in this case, we just allow that to happen. Extra +care is taken to only remove a directory from the linked-list if there are +no parents left in the filesystem. + +## Wear awareness + +So now that we have all of the pieces of a filesystem, we can look at a more +subtle attribute of embedded storage: The wear down of flash blocks. + +The first concern for the littlefs, is that prefectly valid blocks can suddenly +become unusable. As a nice side-effect of using a COW data-structure for files, +we can simply move on to a different block when a file write fails. All +modifications to files are performed in copies, so we will only replace the +old file when we are sure none of the new file has errors. Directories, on +the other hand, need a different strategy. + +The solution to directory corruption in the littlefs relies on the redundant +nature of the metadata pairs. If an error is detected during a write to one +of the metadata pairs, we seek out a new block to take its place. Once we find +a block without errors, we iterate through the directory tree, updating any +references to the corrupted metadata pair to point to the new metadata block. +Just like when we remove directories, we can lose power during this operation +and end up with a desynchronized metadata pair in our filesystem. And just like +when we remove directories, we leave the possibility of a desynchronized +metadata pair up to the deorphan step to clean up. + +Here's what encountering a directory error may look like with all of +the directories and directory pointers fully expanded: +``` + root dir + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 0 |--. + | | |-.| +.------| | |-|' +|.-----| | |-' +|| '---------'---------' +|| |||||'--------------------------------------------------. +|| ||||'-----------------------------------------. | +|| |||'-----------------------------. | | +|| ||'--------------------. | | | +|| |'-------. | | | | +|| v v v v v v +|| dir A dir B dir C +|| block 3 block 4 block 5 block 6 block 7 block 8 +|| .---------.---------. .---------.---------. .---------.---------. +|'->| rev: 1 | rev: 0 |->| rev: 1 | rev: 0 |->| rev: 1 | rev: 0 | +'-->| | |->| | |->| | | + | | | | | | | + | | | | | | | | | + '---------'---------' '---------'---------' '---------'---------' + +| update directory B +v + + root dir + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 0 |--. + | | |-.| +.------| | |-|' +|.-----| | |-' +|| '---------'---------' +|| |||||'--------------------------------------------------. +|| ||||'-----------------------------------------. | +|| |||'-----------------------------. | | +|| ||'--------------------. | | | +|| |'-------. | | | | +|| v v v v v v +|| dir A dir B dir C +|| block 3 block 4 block 5 block 6 block 7 block 8 +|| .---------.---------. .---------.---------. .---------.---------. +|'->| rev: 1 | rev: 0 |->| rev: 1 | rev: 2 |->| rev: 1 | rev: 0 | +'-->| | |->| | corrupt!|->| | | + | | | | | corrupt!| | | | + | | | | | corrupt!| | | | + '---------'---------' '---------'---------' '---------'---------' + +| oh no! corruption detected +v allocate a replacement block + + root dir + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 0 |--. + | | |-.| +.------| | |-|' +|.-----| | |-' +|| '---------'---------' +|| |||||'----------------------------------------------------. +|| ||||'-------------------------------------------. | +|| |||'-----------------------------. | | +|| ||'--------------------. | | | +|| |'-------. | | | | +|| v v v v v v +|| dir A dir B dir C +|| block 3 block 4 block 5 block 6 block 7 block 8 +|| .---------.---------. .---------.---------. .---------.---------. +|'->| rev: 1 | rev: 0 |->| rev: 1 | rev: 2 |--->| rev: 1 | rev: 0 | +'-->| | |->| | corrupt!|--->| | | + | | | | | corrupt!| .->| | | + | | | | | corrupt!| | | | | + '---------'---------' '---------'---------' | '---------'---------' + block 9 | + .---------. | + | rev: 2 |-' + | | + | | + | | + '---------' + +| update root directory to contain block 9 +v + + root dir + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 2 |--. + | | |-.| +.-----| | |-|' +|.----| | |-' +|| '---------'---------' +|| .--------'||||'----------------------------------------------. +|| | |||'-------------------------------------. | +|| | ||'-----------------------. | | +|| | |'------------. | | | +|| | | | | | | +|| v v v v v v +|| dir A dir B dir C +|| block 3 block 4 block 5 block 9 block 7 block 8 +|| .---------.---------. .---------. .---------. .---------.---------. +|'->| rev: 1 | rev: 0 |-->| rev: 1 |-| rev: 2 |--->| rev: 1 | rev: 0 | +'-->| | |-. | | | |--->| | | + | | | | | | | | .->| | | + | | | | | | | | | | | | + '---------'---------' | '---------' '---------' | '---------'---------' + | block 6 | + | .---------. | + '------------>| rev: 2 |-' + | corrupt!| + | corrupt!| + | corrupt!| + '---------' + +| remove corrupted block from linked-list +v + + root dir + block 1 block 2 + .---------.---------. + | rev: 1 | rev: 2 |--. + | | |-.| +.-----| | |-|' +|.----| | |-' +|| '---------'---------' +|| .--------'||||'-----------------------------------------. +|| | |||'--------------------------------. | +|| | ||'--------------------. | | +|| | |'-----------. | | | +|| | | | | | | +|| v v v v v v +|| dir A dir B dir C +|| block 3 block 4 block 5 block 9 block 7 block 8 +|| .---------.---------. .---------.---------. .---------.---------. +|'->| rev: 1 | rev: 2 |->| rev: 1 | rev: 2 |->| rev: 1 | rev: 0 | +'-->| | |->| | |->| | | + | | | | | | | | | + | | | | | | | | | + '---------'---------' '---------'---------' '---------'---------' +``` + +Also one question I've been getting is, what about the root directory? +It can't move so wouldn't the filesystem die as soon as the root blocks +develop errors? And you would be correct. So instead of storing the root +in the first few blocks of the storage, the root is actually pointed to +by the superblock. The superblock contains a few bits of static data, but +outside of when the filesystem is formatted, it is only updated when the root +develops errors and needs to be moved. + +## Wear leveling + +The second concern for the littlefs, is that blocks in the filesystem may wear +unevenly. In this situation, a filesystem may meet an early demise where +there are no more non-corrupted blocks that aren't in use. It may be entirely +possible that files were written once and left unmodified, wasting the +potential erase cycles of the blocks it sits on. + +Wear leveling is a term that describes distributing block writes evenly to +avoid the early termination of a flash part. There are typically two levels +of wear leveling: +1. Dynamic wear leveling - Blocks are distributed evenly during blocks writes. + Note that the issue with write-once files still exists in this case. +2. Static wear leveling - Unmodified blocks are evicted for new block writes. + This provides the longest lifetime for a flash device. + +Now, it's possible to use the revision count on metadata pairs to approximate +the wear of a metadata block. And combined with the COW nature of files, the +littlefs could provide a form of dynamic wear leveling. + +However, the littlefs does not. This is for a few reasons. Most notably, even +if the littlefs did implement dynamic wear leveling, this would still not +handle the case of write-once files, and near the end of the lifetime of a +flash device, you would likely end up with uneven wear on the blocks anyways. + +As a flash device reaches the end of its life, the metadata blocks will +naturally be the first to go since they are updated most often. In this +situation, the littlefs is designed to simply move on to another set of +metadata blocks. This travelling means that at the end of a flash device's +life, the filesystem will have worn the device down as evenly as a dynamic +wear leveling filesystem could anyways. Simply put, if the lifetime of flash +is a serious concern, static wear leveling is the only valid solution. + +This is a very important takeaway to note. If your storage stack uses highly +sensitive storage such as NAND flash. In most cases you are going to be better +off just using a [flash translation layer (FTL)](https://en.wikipedia.org/wiki/Flash_translation_layer). +NAND flash already has many limitations that make it poorly suited for an +embedded system: low erase cycles, very large blocks, errors that can develop +even during reads, errors that can develop during writes of neighboring blocks. +Managing sensitive storage such as NAND flash is out of scope for the littlefs. +The littlefs does have some properties that may be beneficial on top of a FTL, +such as limiting the number of writes where possible. But if you have the +storage requirements that necessitate the need of NAND flash, you should have +the RAM to match and just use an FTL or flash filesystem. + +## Summary + +So, to summarize: + +1. The littlefs is composed of directory blocks +2. Each directory is a linked-list of metadata pairs +3. These metadata pairs can be updated atomically by alternative which + metadata block is active +4. Directory blocks contain either references to other directories or files +5. Files are represented by copy-on-write CTZ linked-lists +6. The CTZ linked-lists support appending in O(1) and reading in O(n logn) +7. Blocks are allocated by scanning the filesystem for used blocks in a + fixed-size lookahead region is that stored in a bit-vector +8. To facilitate scanning the filesystem, all directories are part of a + linked-list that is threaded through the entire filesystem +9. If a block develops an error, the littlefs allocates a new block, and + moves the data and references of the old block to the new. +10. Any case where an atomic operation is not possible, it is taken care of + by a deorphan step that occurs on the first allocation after boot + +Welp, that's the little filesystem. Thanks for reading! + -- cgit v1.2.3