This was an idea I had back in the days of Naptster.
At the turn of the century, it was common to listen to an "acquired" music file only to find it was missing a few seconds at the end due to a prematurely stopped download. Some video formats would refuse to play at all if the moov atom at the end of the file was missing.
I wondered if it would be possible to make a file format which was close to impossible to read unless the entire file was intact. I don't mean including a checksum to detect download errors - I mean a layout which was intrinsically fragile to corruption.
While digging through an old backup CD, I found my original notes. I'm rather impressed at what neophyte-me had constructed. My outline was:
- The file ends with a 32 bit pointer. This points to the location of the first information block.
- The information block describes the length of the data block which follows it.
- At the end of the data block is another 32 bit pointer. This points to the location of the next information block.
- The start of the file may be a pointer, or it may be padded with random data.
- There may be random data padded between the data blocks.
This ensures that a file which has been only partially downloaded - whether truncated at the end or missing pieces elsewhere - cannot be successfully read.
Here's a worked example. Start at the end and follow the thread.
- Random data.
- Data block size is 2.
- Data
- Data
- EOF.
- Data block size is 1.
- Data.
- Go to location 1.
- Random data.
- Go to location 5.
There are, of course, a few downsides to this idea.
Most prominently, it bloats file size. If the data block size was a constant 1MB, that would pad the size a negligible amount. But with variable data block size, it could increase it significantly. Random padding also increases the size.
If the block size is consistent and there's no random padding data, the files can be mostly reconstructed.
Depending on which parts of the file are missing, it may be possible to recover the majority of the file.
A location block size of 32 bits restricts the file-size to less than 4GB. A 64 bit pointer might be excessive or might be future-proof!
Highly structured files with predictable patterns, or text files, may be easy to recover large bits of information.
A malformed file could contain an infinite loop of pointers.
Perhaps a magic number should be at the start (or end) of the file?
While reading the file is as simple as following the pointers, constructing the file is more complex, especially if blocks have variable lengths.
Code
Here's a trivial encoder. It reads a file in consistently sized chunks of 1,024 bytes. It shuffles them up and writes them to a new file. The last 4 bytes contain a pointer to the first block, which says the data length is 1,024. After that, there is a 4 byte pointer to the next block location.
Python 3
import random # Size of data, headers, and pointers. data_length = 1024 header_length = 4 pointer_length = 4 # Read the file into a data structure. original_blocks = list() with open( "test.jpg", "rb") as file: for data in iter( lambda: file.read( data_length ), b"" ): # Add padding if length is less than the desired length. padding = data_length - len( data ) data += b"\0" * padding original_blocks.append( data ) # How many blocks are there? original_length = len( original_blocks ) # Create a random order of blocks. order = list( range( 0, original_length ) ) random.shuffle( order ) # Where is the start of the file? first_block_index = order.index( 0 ) first_block_pointer = first_block_index * ( header_length + data_length + pointer_length ) # Loop through the order and write to a new file. i = 0 # Open as binary file to add the pointers correctly. with open( "output.rff", "wb" ) as output: while i < original_length: # Where are we? current_block = i current_block_value = order[i] # Write length of data in little-endian 32 bytes. output.write( data_length.to_bytes( header_length, "little") ) # Write data output.write( original_blocks[ current_block_value ] ) i = i+1 # Last block. Write an EOF header. if ( current_block_value + 1 >= original_length ): eof = 4294967295 output.write( eof.to_bytes( header_length, "little") ) else: next_block = order.index( current_block_value + 1 ) # Write pointer to next block next_block_location = next_block * ( header_length + data_length + pointer_length ) output.write( next_block_location.to_bytes( pointer_length, "little" ) ) # At the end of the file, write the pointer to block 0. output.write( first_block_pointer.to_bytes( pointer_length, "little" ) )
And here is a similarly trivial decoder. It reads the last 32 bits, moves to that location, reads the block size, reads the data and writes it to a new file, then reads the next pointer.
Python 3
import os # Size of data, headers, and pointers. header_length = 4 pointer_length = 4 # File name to write to. decoded_file = "decoded.bin" # Create an empty file. with open( decoded_file, "w") as file: pass # Function to loop through the blocks. def read_block( position, i ): # Move to the position in the file. input_file.seek( position, 0 ) # Read the data length header. data_length = int.from_bytes( input_file.read( header_length ), "little" ) # Move to the data block. input_file.seek( position + header_length, 0 ) # Read the data. data = input_file.read( data_length ) # Read the pointer header. next_position = int.from_bytes( input_file.read( pointer_length ), "little" ) # If this is the final block, it may have null padding. Remove it. if ( next_position == 4294967295 ) : data = data.rstrip(b"\0") # Append the data to the decoded file. with open( decoded_file, "ab" ) as file: file.write( data ) # If this is the final block, finish searching. if ( next_position == 4294967295 ) : print("File decoded.") else: # Move to the next position. read_block( next_position, i+1 ) # Open the file as binary. input_file = open( "output.rff", "rb" ) # Read the last 4 bytes. input_file.seek( -4, 2 ) # Get position of first block first_block = int.from_bytes( input_file.read(), "little" ) # Start reading the file. seek_to = first_block read_block( seek_to, 0 )
As I said, these are both trivial. They are a bit buggy and contain some hardcoded assumptions.
Here are two files encoded as "RFF" - Random File Format - an image by Maria Sibylla Merian, and the text of Romeo and Juliet.
Have fun decoding them!
3 thoughts on “Random File Format”
It must be something like this they use in movies: “top secret file which will kill many people if the bad guys get it” is being dramatically and slowly uploaded, and the good guys manage to stop the upload at 90%, and of course fortunately this means the bad guys have nothing and the day is saved!
| Reply to original comment on bsky.app
Joker_vD
Well, we can take a page from the .zip file format (also known as ZIP). Whether a file is a ZIP archive is determined by the presence of an end of central directory record, which is located at the end of the archive structure (to allow the easy appending of new files). It doesn't have to reside at the exact end of file, has variable length and identified by a magic number which must be followed by certain fixed-length fields (one of which points to the start of the central directory), and a length-prepended comment up to 64 KiB. You need to scan the file backwards for the magic number, then try to extract the EOCD record starting from it; if you fail, keep backing off towards the beginning of the file until you run out of data: at this point, you've determined that it's not, in fact, a ZIP archive.
Bob
Could just encrypt the file and append the key on the end. Further encrypt the key with the SHA of the encrypted file if you want to protect against, say, being able to download the end of the file first to get the key then start a partial decrypt from the beginning.
More comments on Mastodon.