Figuring out what's in any Unix file

28 Jun 2015

If you have a file that ends in a known extension like list.txt or mountain.jpg then you can guess what’s inside it. But how do you actually know? After all, you could mv list.txt kitten.png and suddenly your plain text file is pretending to be an image of an adorable little cat.

If the contents of the file are text then the easiest way to look inside is the head command:

> $ head kitten.png
Go to the store and buy:
* Overpriced avocados
* Organic lawn furniture
* Seven gallons of 1% milk

head shows you the first 10 (by default) lines of any file. And as long as it’s a text file with newline characters signifying line breaks then this shows you something readable. But what if it’s actually an image file?

> $ head actually_a_kitten.png
+iCCPICC Profile(œùíw˚æŸÉïÜå∞˜^Dˆô≤D!$1$
⎼▒☃┌⎽ └▒⎽├e⎼ $

Oh lovely. That last line there (ending in a $) is actually the prompt for your next command. We just accidentally sent so many weird characters to the screen that some of them changed the encoding of your terminal. No fun.

A safer way to do this would be with the less command which creates a new full-terminal buffer and safely returns you to your (working) prompt when you’re done. But even then you need to parse this binary with your eyes and hope that the âPNG is, in fact, a sign that this is a valid png file.

The file command

Luckily, any Unixy OS has the file utility installed. This is a program that will try to guess the type of a file based on examining details of the contents.

> $ file actually_a_kitten.png
actually_a_kitten.png: PNG image data, 8 x 8, 8-bit/color RGBA, non-interlaced

Not only do we know for sure that it’s a PNG but we know the resolution and the color depth. Handy.

How does file work?

This utility first tries to make a distinction about whether the argument is a file that’s safe to print on the screen (“text”), executable, or any other kind of data.

The procedure for determining this is to first look at the file on disk and see if it’s special in some way, empty, or if there’s some other significant issue with it that makes examining the contents unnecessary. It basically runs the stat command on your file:

> $ stat kitten.png
  File: 'kitten.png'
  Size: 3084            Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 533679      Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1193/jackdanger)   Gid: ( 1193/jackdanger)
Access: 2015-06-28 07:35:35.864034633 +0000
Modify: 2015-06-28 07:35:35.864034633 +0000
Change: 2015-06-28 07:35:35.864034633 +0000

It’s a file (not a symlink or a directory) and it takes up 8 blocks of space, using a total of 3084 bytes. Alright then, let’s read it and figure out what’s inside.

Finding the magic on your computer

When I first learned how this worked the person showing me the ropes asked “Do you want to see where the magic is?” and I was utterly confused. Then he ran man file and pointed out that the process of determining what’s inside a file from individual heuristics is, in fact, known as magic.

Or, or more specifically:

> $ man file | tr ' ' "\n" | grep /magic  # this tr will swap spaces for newlines

Before we look inside one of these files by hand let’s practice using file itself:

> $ file /usr/share/misc/magic.mgc
/usr/share/misc/magic: symbolic link to `../file/magic.mgc'

Oh, that didn’t even inspect the contents of the file – it stopped at the stat step. What did it see there?

> $ stat /usr/share/misc/magic.mgc
  File: ‘/usr/share/misc/magic.mgc’ -> ‘../file/magic.mgc’
  Size: 17              Blocks: 0          IO Block: 4096   symbolic link
Device: ca01h/51713d    Inode: 2023        Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2015-06-28 08:12:44.599057183 +0000
Modify: 2014-07-10 17:48:04.000000000 +0000
Change: 2014-12-15 08:35:19.791057183 +0000
 Birth: -

Yeah, I see the words ‘symbolic link’ there for sure. Now let’s follow the symlink and see what’s really inside it:

> $ readlink -f /usr/share/misc/magic.mgc
/usr/share/file/magic.mgc  # The following is identical to:
> $ file $(readlink -f !$)  # "file /usr/share/file/magic.mgc"
/usr/share/file/magic.mgc: magic binary file
for file(1) cmd (version 8) (little endian)

This is great. It knows this file so well that it knows it’s its own magic file-detection file. So what’s inside it?

> $ less /usr/share/file/magic.mgc

Yech. How about one of the others?

> $ head -n 15 /usr/share/file/magic
## $File: acorn,v 1.5 2009/09/19 16:28:07 christos Exp $
## acorn:  file(1) magic for files found on Acorn systems

## RISC OS Chunk File Format
## From RISC OS Programmer's Reference Manual, Appendix D
## We guess the file type from the type of the first chunk.
0       lelong          0xc3cbc6c5      RISC OS Chunk data
>12     string          OBJ_            \b, AOF object
>12     string          LIB_            \b, ALF library

# RISC OS AIF, contains "SWI OS_Exit" at offset 16.
16      lelong          0xef000011      RISC OS AIF executable

Now we’re talking. This format is a series of specific checks for byte values in a file. The first couple examples are kinda hard to make sense of so I’ll skip down to a simpler one:

0       string          \x89PNG\x0d\x0a\x1a\x0a         PNG image data
!:mime  image/png
>16     belong          x               \b, %ld x
>20     belong          x               %ld,
>24     byte            x               %d-bit
>25     byte            0               grayscale,
>25     byte            2               \b/color RGB,
>25     byte            3               colormap,
>25     byte            4               gray+alpha,
>25     byte            6               \b/color RGBA,
#>26    byte            0               deflate/32K,
>28     byte            0               non-interlaced
>28     byte            1               interlaced

This means “If the string of bytes starting at position 0 are as follows then it’s a ‘PNG image data’”. Each of the lines starting with > can refine the result, adding more data to it. Apparently it’s part of the PNG format that if the 25th byte is a 4 then this is a gracscale image with an alpha channel. Cool.

Hey, we can detect Photoshop files, too!

0       string          8BPS Adobe Photoshop Image
!:mime  image/vnd.adobe.photoshop
>4   beshort 2 (PSB)
>18  belong  x \b, %d x
>14  belong  x %d,
>24  beshort 0 bitmap
>24  beshort 1 grayscale
>>12 beshort 2 with alpha
>24  beshort 2 indexed
>24  beshort 3 RGB
>>12 beshort 4 \bA
>24  beshort 4 CMYK
>>12 beshort 5 \bA
>24  beshort 7 multichannel
>24  beshort 8 duotone
>24  beshort 9 lab
>12  beshort > 1
>>12  beshort x \b, %dx
>12  beshort 1 \b,
>22  beshort x %d-bit channel
>12  beshort > 1 \bs

There are some 500 known formats in here (egrep -c '^# .* file' /usr/share/misc/magic) detecting all kinds of things. How correct is this set of heuristics, though? Since it’s just looking for key data and ignoring most other bytes is it possible to fool it?

To generate fake data we’re going to read from /dev/urandom and write it straight to standard out (i.e. your screen). we’ll do this 512 bytes at a time and pipe it straight to file as a special (-s) file:

> $ dd if=/dev/urandom bs=512 | file -s -
/dev/stdin: data

Okay, not much. It’s “data”. Great. But if we try that in a bash loop…

> $ for _ in {0..20}; do dd if=/dev/urandom bs=512 | file -s -; done
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: MPEG ADTS, layer III, v1,  40 kbps, 48 kHz, Monaural
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data
/dev/stdin: data

Wait. What? What’s an MPEG ADTS file? Doesn’t matter, we generated data that fit the profile of it with just a few bytes in the right place.

And if we run this in a loop and ignore the misses we see way more interesting stuff.

> $ while :; do dd if=/dev/urandom bs=512 | file -s - | grep -v ': data'; done
/dev/stdin: Sendmail frozen configuration  - version =\236\207\021\033\316h\312d\236i\326\033\236\350C\234\373
/dev/stdin: 8086 relocatable (Microsoft)
/dev/stdin: hp200 (68010) BSD
/dev/stdin: Sendmail frozen configuration  - version \332\317\022?\310]\272s+^\021\032\212\234Q\341;\326\320
/dev/stdin: SysEx File - Fujitsu
/dev/stdin: DBase 3 data file
/dev/stdin: DBase 3 data file (899713281 records)
/dev/stdin: MPEG-4 LOAS, single stream
/dev/stdin: DBase 3 data file with memo(s) (1054998150 records)
/dev/stdin: Dyalog APL version 219 .105

This really is magic, then, in the sense that this isn’t predictable and isn’t super reliable. But it’s still more effective than trying to cat it to your screen and look at the bytes yourself.

Please if you found this post helpful or have questions.