I saw this video about split
earlier today and thought I noticed
something odd.
Very succinctly, here’s what the video shows:
$ seq 1 100 > f
$ split -n l/5 f
$ ls
f xaa xab xac xad xae
The command splits f
, a file with 100 lines, into 5 “equally-sized” parts,
each prefixed by an x
.
But… they’re not actually the same size:
$ wc *
100 100 292 f
23 23 60 xaa
19 19 57 xab
19 19 57 xac
20 20 60 xad
19 19 58 xae
200 200 584 total
Now if I was paying attention to the video I would’ve understood that it’s splitting the file by byte count (instead of line count). But I wasn’t. It also doesn’t actually mention that the chunks are supposed to be the same size. Oops.
Anyway, I went down a bit of a rabbit hole. I thought I’d share my intuition for understanding what’s going on.
-
f
is 292 bytes. It’s made up of 9 single-digit numbers (1 to 9), 90 double-digit numbers (10 to 99) and 1 triple-digit number (100). Each line also has a newline character. That’s (9 × 1) + (90 × 2) + (1 × 3) + 100 = 292 bytes. -
We asked to split
f
into 5 files, without breaking lines. That’s 292 bytes ÷ 5 chunks = 58.4 bytes per chunk. -
So
split
is probably continually copying some amount off
into a new file, until that file is 58.4 bytes. It needs to keeps going until the next newline character, since we told it to not break in-between lines. It then repeats the process with more new files, until reaching the end off
.Here’s a visualization of the first two chunks. The line with “23” doesn’t fit into the initial 58.4 bytes, which is why the
xaa
file above is 60 bytes.$ hexdump -C -s 0 -n 58.4 f 00000000 31 0a 32 0a 33 0a 34 0a 35 0a 36 0a 37 0a 38 0a |1.2.3.4.5.6.7.8.| 00000010 39 0a 31 30 0a 31 31 0a 31 32 0a 31 33 0a 31 34 |9.10.11.12.13.14| 00000020 0a 31 35 0a 31 36 0a 31 37 0a 31 38 0a 31 39 0a |.15.16.17.18.19.| 00000030 32 30 0a 32 31 0a 32 32 0a 32 |20.21.22.2| 0000003a $ hexdump -C -s 58.4 -n 58.4 f 0000003a 33 0a 32 34 0a 32 35 0a 32 36 0a 32 37 0a 32 38 |3.24.25.26.27.28| 0000004a 0a 32 39 0a 33 30 0a 33 31 0a 33 32 0a 33 33 0a |.29.30.31.32.33.| 0000005a 33 34 0a 33 35 0a 33 36 0a 33 37 0a 33 38 0a 33 |34.35.36.37.38.3| 0000006a 39 0a 34 30 0a 34 31 0a 34 32 |9.40.41.42| 00000074
For a file with 100 double-digit numbers, the output chunks are indeed the same size:
$ seq -w 0 99 > f2 # -w pads all numbers to be the same length, like left-pad
$ split -n l/5 f2
$ wc *
100 100 300 f2
20 20 60 xaa
20 20 60 xab
20 20 60 xac
20 20 60 xad
20 20 60 xae
200 200 600 total
Perfect and round, just like the sun.
Note that this is about GNU split and not BSD split (which is the default on MacOS), because the two are very different, probably for a really good reason.