Social Media is Stupid

I was feeling particularly impulsive the other day and was about to post a meme to linkedin, but then got distracted by their url preview feature.

If you don’t know: when typing a link, after a few seconds, it shows a little preview of the document at that link. It includes a description and a banner image and an estimated reading time and some other things.

I previewed some links to my personal web sites and here’s what I found:

They start by making a request to a service called the voyager: https://linkedin.com/voyager/.../urlPreview/https%3A%2F%2Fexample.com. That responds with some json. They’re also making an unrelated “track” request every 2 milliseconds, which, unlike every other word I’ve ever written in my life, is not exaggeration. Fun stuff.
Those requests appear to tell a bot to visit the typed url and parse metadata.

And here’s the request from the bot to my domain:

{
    "remote_addr": "108.174.2.215:38142",
    "proto": "HTTP/1.1",
    "method": "GET",
    "host": "kashav.ca",
    "uri": "/",
    "headers":
    {
        "X-Li-Authinfo-Id":
        [
            "babylonia-ingestion"
        ],
        "Range":
        [
            "bytes=0-3145727"
        ],
        "Accept-Encoding":
        [
            "gzip,deflate"
        ],
        "Service-Name":
        [
            "babylonia-ingestion"
        ],
        "X-Li-Calltree-Request-Id":
        [
            "AAAAAAAAAAAAAAAAAAAAAA=="
        ],
        "Accept":
        [
            "*/*"
        ],
        "User-Agent":
        [
            "LinkedInBot/1.0 (compatible; Mozilla/5.0; Apache-HttpClient +http://www.linkedin.com)"
        ],
        "Connection":
        [
            "close"
        ]
    },
    "tls":
    {
        "resumed": false,
        "version": 771,
        "cipher_suite": 49195,
        "proto": "",
        "proto_mutual": true,
        "server_name": "kashav.ca"
    }
}

That Range header means they’re only requesting the first 3145727 bytes. I heard the engineer was trying to recite pi from memory but didn’t do too great.

For documents larger than 3.14 mb, they only request the first chunk, so you probably shouldn’t trust the estimated reading time.

And then obviously I yandexed that babylonia service name and found https://resources.sei.cmu.edu/asset_files/presentation/2018_017_001_519133.pdf. It’s a breakdown of linkedin’s content digestion pipelines, just in case you feel like learning something real today.