100 % 5

I saw this video about split earlier today and thought I noticed something odd.

Very succinctly, here’s what the video shows:

$ seq 1 100 > f
$ split -n l/5 f
$ ls
f  xaa  xab  xac  xad  xae

The command splits f, a file with 100 lines, into 5 “equally-sized” parts, each prefixed by an x.

But… they’re not actually the same size:

$ wc *
     100     100     292 f
      23      23      60 xaa
      19      19      57 xab
      19      19      57 xac
      20      20      60 xad
      19      19      58 xae
     200     200     584 total

Now if I was paying attention to the video I would’ve understood that it’s splitting the file by byte count (instead of line count). But I wasn’t. It also doesn’t actually mention that the chunks are supposed to be the same size. Oops.

Anyway, I went down a bit of a rabbit hole. I thought I’d share my intuition for understanding what’s going on.

  1. f is 292 bytes. It’s made up of 9 single-digit numbers (1 to 9), 90 double-digit numbers (10 to 99) and 1 triple-digit number (100). Each line also has a newline character. That’s (9 × 1) + (90 × 2) + (1 × 3) + 100 = 292 bytes.

  2. We asked to split f into 5 files, without breaking lines. That’s 292 bytes ÷ 5 chunks = 58.4 bytes per chunk.

  3. So split is probably continually copying some amount of f into a new file, until that file is 58.4 bytes. It needs to keeps going until the next newline character, since we told it to not break in-between lines. It then repeats the process with more new files, until reaching the end of f.

    Here’s a visualization of the first two chunks. The line with “23” doesn’t fit into the initial 58.4 bytes, which is why the xaa file above is 60 bytes.

    $ hexdump -C -s 0 -n 58.4 f
    00000000  31 0a 32 0a 33 0a 34 0a  35 0a 36 0a 37 0a 38 0a  |1.2.3.4.5.6.7.8.|
    00000010  39 0a 31 30 0a 31 31 0a  31 32 0a 31 33 0a 31 34  |9.10.11.12.13.14|
    00000020  0a 31 35 0a 31 36 0a 31  37 0a 31 38 0a 31 39 0a  |.15.16.17.18.19.|
    00000030  32 30 0a 32 31 0a 32 32  0a 32                    |20.21.22.2|
    0000003a
    $ hexdump -C -s 58.4 -n 58.4 f
    0000003a  33 0a 32 34 0a 32 35 0a  32 36 0a 32 37 0a 32 38  |3.24.25.26.27.28|
    0000004a  0a 32 39 0a 33 30 0a 33  31 0a 33 32 0a 33 33 0a  |.29.30.31.32.33.|
    0000005a  33 34 0a 33 35 0a 33 36  0a 33 37 0a 33 38 0a 33  |34.35.36.37.38.3|
    0000006a  39 0a 34 30 0a 34 31 0a  34 32                    |9.40.41.42|
    00000074
    

For a file with 100 double-digit numbers, the output chunks are indeed the same size:

$ seq -w 0 99 > f2 # -w pads all numbers to be the same length, like left-pad
$ split -n l/5 f2
$ wc *
     100     100     300 f2
      20      20      60 xaa
      20      20      60 xab
      20      20      60 xac
      20      20      60 xad
      20      20      60 xae
     200     200     600 total

Perfect and round, just like the sun.


Note that this is about GNU split and not BSD split (which is the default on MacOS), because the two are very different, probably for a really good reason.