Berke Güzel

Super. Bad. Code.

You (probably) do not need mediainfo

nor ffmpeg/ffprobe

It is surprising to me that in the future age of 2026, web services, developers, and browsers still rely on file extensions to recognize files and filter them for what they are. Since the dawn of time it’s been known that an extension is a hint to what file might be, and not what that file is.[1] We’ve had a number of cases where YouTube channels[2] got their session token stolen with a simple .scr (screensaver file) renamed as .pdf.

I am going to talk more about file extensions on a different post, for now though…

mediainfo

Mediainfo is a C++ library and CLI tool that extracts metadata from video files. It’s widely used, well-maintained, and supports virtually every format ever created. There’s even a WASM port.

So why am I telling you not to use it?

Because if all you need is resolution, duration, frame rate, and codec from an MP4 upload (or webm if you’re adventurous) - that information is sitting in the file, in a spec that’s been stable since the early 2000s, and you can read it yourself in ~150 lines of Python.

what?

Yes, did you know that files contained info? I know, mind blowing concept. MP4/MOV spec is basically 4 byte chunks of data arranged like this: [4 bytes: size][4 bytes: type][...data...]

if we take a hex dump of an mp4 file using the lovely xxd tool we see:

xxd audio.mp4 | head -n 20  
00000000: 0000 001c 6674 7970 6973 6f6d 0000 0200  ....ftypisom....  
00000010: 6973 6f6d 6973 6f32 6d70 3431 0000 0008  isomiso2mp41....  
00000020: 6672 6565 0000 7745 6d64 6174 de02 004c  free..wEmdat...L  
00000030: 6176 6336 322e 3131 2e31 3030 0042 2008  avc62.11.100.B .  
00000040: c118 3821 1004 608c 1c21 1004 608c 1c21  ..8!..`..!..`..!

That’s an iso media mp4, and has an avc codec?! No way! In fairness, this is only part of the story, we do need to parse the actual moov header, which can either be at the end, or at the start of the file. For this demonstration, the file is an audio file encoded with ffmpeg as “mp4” without using -movflags +faststart (which just moves the moov info to the start of the file), so the moov info is at the end of the file. We need to find where moov starts and tell xxd to only read from there onward.

xxd -s $(grep -oba 'moov' audio.mp4 | cut -d: -f1) -l 512 audio.mp4
0000776d: 6d6f 6f76 0000 006c 6d76 6864 0000 0000  moov...lmvhd....
000077dd: 0000 03b5 7472 616b 0000 005c 746b 6864  ....trak...	khd
0000782d: 0000 0000 0000 0000 4000 0000 0000 0000  ........@.......
0000783d: 0000 0000 0000 0024 6564 7473 0000 001c  .......$edts....
0000784d: 656c 7374 0000 0000 0000 0001 0000 07d0  elst............
0000785d: 0000 02b0 0001 0000 0000 032d 6d64 6961  ...........-mdia
0000786d: 0000 0020 6d64 6864 0000 0000 0000 0000  ... mdhd........
0000787d: 0000 0000 0000 bb80 0001 79b0 15c7 0000  ..........y.....
0000788d: 0000 002d 6864 6c72 0000 0000 0000 0000  ...-hdlr........
0000789d: 736f 756e 0000 0000 0000 0000 0000 0000  soun............
000078ad: 536f 756e 6448 616e 646c 6572 0000 0002  SoundHandler....
000078bd: d86d 696e 6600 0000 1073 6d68 6400 0000  .minf....smhd...
000078cd: 0000 0000 0000 0000 2464 696e 6600 0000  ........$dinf...
000078dd: 1c64 7265 6600 0000 0000 0000 0100 0000  .dref...........
000078ed: 0c75 726c 2000 0000 0100 0002 9c73 7462  .url ........stb
000078fd: 6c00 0000 7e73 7473 6400 0000 0000 0000  l...~stsd.......
0000790d: 0100 0000 6e6d 7034 6100 0000 0000 0000  ....nmp4a.......

Let’s unpack this:

  • grep -oba ‘moov’ audio.mp4 returns us the offset where moov byte was found (30573:moov)
  • cut -d: -f1 formats grep output to only take the first part (30573)
  • xxd -s tells xxd to start reading from offset 30573

Now that the cli soup is out of the way, let’s dive in. trak is what we’re interested in, that is telling us that there is a video or audio track. Formatting is messed up as a part of everything having to be in 4 byte chunks. And under trak if you look close you can find information such as:

  • soun — this is an audio track (vide for videos, 4 bytes)
  • under stsd is n+mp4a, which is size+our audio codec’s id under mdhd we have track info, and if we decode it:
0000 0020           = box size (32 bytes)
6d64 6864           = "mdhd"
00                  = version
00 0000             = flags
0000 0000           = creation time
0000 0000           = modification time
0000 bb80           = timescale (48000 - this is audio, 48kHz)
0001 79b0           = duration in timescale units (96688)

If you’re wondering what we’re doing we’re literally reading bytes and parsing it into data, notice the 0000 0020 6d64 6864 in earlier output.

So now, if we take everything we just found out together: mp4a codec (AAC), 2 second stream in audio.mp4, this is an audio file.

to confirm, here is ffmpeg’s output:

ffprobe audio.mp4 
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'audio.mp4':
  Duration: 00:00:02.00, start: 0.000000, bitrate: 126 kb/s
  Stream #0:0[0x1](eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 121 kb/s (default)
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]

Voila!

But… why?

Right. All this moov parsing stuff is very cool but we can just install ffmpeg, all it takes is package-manager install-command ffmpeg innit? Yes.

For now, let’s assume you’re an API, a backend server to some frontend, handing cdn urls and letting frontend upload whatever they wish to upload to that particular url. How do you know what they uploaded? The naive approach is to let frontend parse the video file (e.g with mediainfo.js) and send the metadata to your backend and you can store that information.

This does play out nicely until… You need a mobile application. You tell the mobile dev that you need output of mediainfo, they look it up, say they can’t use it because it’s not “react-native supported”. Now either mobile dev needs to find a different library, or parse the bytes itself, or…

Let backend handle it?

That is pretty smart innit, just a simple os.exec(ffprobe) and some string parsing and voila, you have your metadata. Not so fast. That requires your prod server to have ffprobe binary, launch a separate process, string match the output, just to get some information that is already available in the file. Instead we can write our own parser, and only request specific bytes from the file server:

def main(url: str):
    READ_RANGE = 131072  # 128 kib in bytes
    
    with httpx.Client() as client:
        head = client.head(url)
        size = int(head.headers['content-length'])
        
        # Try head first
        r = client.get(url, headers={'Range': f'bytes=0-{READ_RANGE-1}'})
        moov_data = r.content
        
        if moov_data.find(b'moov') == -1:
            # Try tail
            r = client.get(url, headers={'Range': f'bytes={size-READ_RANGE}-{size-1}'})
            moov_data = r.content
            
            if moov_data.find(b'moov') == -1:
                print("No moov found")
                return
        
        tracks = parse_tracks(moov_data)
        for t in tracks:
            print(t)

This way, we both avoid having a 3rd party dependency, and skip the (arguably minimal) latency that comes from launching a process! We fetch what we need, parse, and return.

Here is the full implementation in Python.

Unless…

Well, this only applies to mp4/mov files, and you can extend it to png/jpg/webm as well, which covers our use case of “upload files to social media”.

The implementation above also does not take into account fragmented mp4 (fmp4), used in streaming. If your use case requires robust handling of all media codecs including proprietary ones from 90s and 2000s, using mediainfo/ffprobe is the way to go.

Otherwise? You probably do not need mediainfo.

footnotes
1: https://www.reddit.com/r/linuxmasterrace/comments/qp5zme/file_extensions_are_hints_as_to_what_might_be_in/
https://medium.com/@ekondur/why-file-extensions-are-not-reliable-0f56e25e5fd2

2: https://www.jonaharagon.com/posts/linus-tech-tips-hacked-whos-to-blame/

← Back to all posts