Recovering (JPEG) files – raptorsnest.nl

Everybody knows you should make a good backup, but how many people actually do it? An old colleague of mine wanted to reinstall Windows on his computer and thought his family photos to be safe on the system’s second hard drive. However, during the reinstall (which included formatting) there was a mix-up of the IDE and SATA drive and their partitions. When he rebooted his computer the photos were gone, replaced by a clean partition containing only the Windows OS.

There are plenty of recovery programs out there, but I haven’t seen a good one that is free. Furthermore those programs need to be installed, potentially overwriting the data you are trying to save. Besides, basic recovery is not that hard if you know your way around Linux a bit. So, I just took my USB drive with a Knoppix install on it and booted the system from there.

First order of business was to secure the data on a seperate USB harddrive before I could (accidentally) mess it up any further;
dd if=/dev/sda of=/media/usbdrive/disk.img
This copies an image of the drive /dev/sda to the file /media/usbdrive/disk.img. Everything is a file under Linux, the /dev/sda file is just a special file that refers to a harddisk. Figuring out which /dev file refers to the harddisk you’re trying to recover, and under which /media directory you can find your USB disk shouldn’t be too hard, and it is probably the trickiest bit in this recovery procedure.

In my case it turned out the system only supported USB 1.0 speeds. That means it would have taken a very long time to copy everything to the USB drive. So I ended up opening the system up and putting the drive in another system to speed things up. In principle, however, that would not have been necessary. In fact, I was overly cautious, I’m pretty sure the entire copy was unnecessary, so wherever I put disk.img in the procedure below, I could have used /dev/sda directly instead.

My recovery strategy was based on the idea that drives are divided into blocks (of 512 bytes generally), and that files do not start in the middle of a block, but at the start of a block. I don’t know about any file system for which this does not hold, so this recovery procedure should be pretty independent of the file system that was on the drive before it was formatted.
So I wrote a little program that read in the start of each block and checked it for a JPEG signature:
gcc findjpeg.c -o findjpeg ./findjpeg disk.img > jpegs.txt
This compiles the program and runs it on disk.img, printing the block numbers for which a JPEG signature was found to the file jpegs.txt. The program will also print a line saying “At block x” every time it has checked 100Mb worth of blocks.
With tail -f jpegs.txt (in another terminal) you can follow its progress. I just let it run overnight.

After determining the blocks where the JPEG files start, it is time to try and retrieve the entire file. For this I relied on the assumption that the rest of the file was stored in the blocks following the first block of the file. This is not necessarily the case. A file may be spread over non-consecutive blocks, but most file systems tend to avoid such fragmentation of files since it slows down the system.
So, if a JPEG file was starting on, say, block 12345 we could do:
dd if=disk.img of=recover.jpg bs=512 skip=12345 count=20000
This copies 20.000 blocks of data (approximately 10Mb), starting from block 12345 from the image of the disk to the file recover.jpg. You can then try to open the file in an image viewer to see if the recovery of that file actually succeeded. In most cases the original file was smaller than 10Mb, to cut it back to its original file size we can use ImageMagick:
convert recover.jpg 12345.jpg
The new 12345.jpg file will then have the correct size.

Instead of doing this by hand we can automate it:
grep JPEG jpegs.txt | cut -f 2 -d " " > blocks.txt
This will take all the block numbers from the lines in jpegs.txt that containg the word “JPEG” and put them in the blocks.txt file. That is the list of blocks where the JPEG files start. Now we do:
for i in `cat blocks.txt`; do echo Recovering $i; dd if=disk.img of=recover.jpg bs=512 skip=$i count=20000; convert recover.jpg $i.jpg; done
This will do the procedure we did above automatically for every block number in the blocks.txt file. It might not be the fastest way of doing it, but I found it the easiest, and in my case it finished after a couple of hours.

So, that recovered the JPEG files in my case. However, the original file names and folders in which they were stored were not recovered. The JPEG files had just a number for a name, corresponding to the place where they were found on the hard drive.
I used the utility renrot to rename the files. If the camera that took the photograph stored a timestamp with it, the renrot utility will rename the file to that timestamp:
renrot *.jpg
If the photo does not contain a time stamp, however, the utility will rename the file to the current time stamp. So after I sorted out the photos which had a real time stamp, I just renumbered the ones with a current timestamp:
j=0;for i in 20101010*.jpg; do mv $i $j.jpg; j=$((j+1)); done

And that’s it. My colleague got his family photos back. He still had to sort them out, since they were no longer in the folders in which he placed them, and the recovery did not distinguish between personal photos and photos from in the internet cache. But at least he got them back.

Note that this recovery procedure should work with other file formats as well. The only thing that is needed is a way to detect the header of the file type and adapt the findjpeg.c program accordingly. Detecting the size of the file, like I did with the “convert” utility for JPEGs, would be handy, but in my experience most programs do not mind if the file is longer than expected.

findjpeg.c Download