Skip to content

Lowly Worm β€” social channel extension (optional)

Last updated: 2026-04-14 Β· Reading time: ~25 min Β· Difficulty: hard

πŸ”¦ Optional extension. This chapter is a layer on top of Ch 07-2a β€” Lowly Worm (newsfeed), which stands on its own and is useful without any of what follows. Read this one only if you want Lowly to also watch a specific social channel (LinkedIn, X, Bluesky, Mastodon, Discord, Slack…) and consolidate what's happening there into your morning edition. If you're not sure whether you need it, deploy 07-2a first, live with it for a couple of weeks, and come back here if you find yourself missing the social-channel half.

TL;DR

  • The idea: your morning Telegram thread should also include filtered notifications (what is happening to you on your primary social channel) and consolidated new messages (DMs grouped by thread and summarized, so one back-and-forth becomes one item instead of forty). This is the shape of the problem, regardless of which platform you're scraping.
  • For me, that channel is LinkedIn, which is where most of my professional contact runs. The rest of this chapter is LinkedIn-specific because that is the surface I have working. Your primary channel may be different β€” the pattern generalizes but the extraction layer is yours to build.
  • LinkedIn is a dark art. The official Python library fails on datacenter IPs. Paid scraper APIs return public keyword matches instead of your actual feed. The DOM is deliberately obfuscated with hashed class names. The only reliable path is Playwright + a persistent Chromium profile authenticated out-of-band. Budget real time for this, and expect a few hours of tune-up every few months.
  • Three surfaces to scrape (plus one optional fourth): the home feed, the notifications stream (filtered for relevance), and DMs (consolidated by thread with LLM-generated summaries via openclaw infer). Profile viewers are the optional fourth.
  • The cost of the extension is zero additional dollars β€” the DM-summarization inference calls ride on your existing OpenAI subscription via openclaw infer. The real cost is maintenance time: Playwright stays fragile, aria-label selectors rot as LinkedIn rebuilds UI, and session profiles expire every few months.

Why you'd want this extension β€” and why you might not

The core Lowly from Ch 07-2a handles RSS newsfeeds well. What it doesn't give you is any signal on what's happening to you on your social channel β€” who commented on your last post, who's asking you something in DMs, whether anyone you want to hear from is active today. For someone whose professional network runs through LinkedIn (like me), that's a meaningful gap, and filling it was worth the scraping effort.

The social-extension agent lives inside the same morning Telegram thread as the newsfeed. A typical morning now looks like:

  • πŸ€– AI & Tech: four items from RSS
  • πŸ’° Economics: three items from RSS
  • 🌍 World: two items from RSS
  • πŸ”— LinkedIn: three filtered notifications + two DM threads summarized
  • πŸ›οΈ US Policy: two items from RSS
  • πŸ“‹ Also noted: a few stragglers

The LinkedIn section sits in the same thread as the news, tagged by category, indistinguishable structurally from the RSS sections β€” the reader can scan one thread and get both the news and the social pulse in the same ninety seconds.

Why you might not. The LinkedIn machinery is real work and real maintenance. The DOM rots every few months. The auth profile expires periodically. The notifications filter needs occasional tuning as LinkedIn adds new noise categories. If any of the following apply, skip this chapter and stick with the core newsfeed from 07-2a:

  • Your primary social channel is not LinkedIn. Nothing below translates directly β€” the extraction layer is LinkedIn-specific. Read this chapter as a pattern (feed + notifications + messages + per-thread summarization via openclaw infer) and build your own on top.
  • You already check the social app directly during the day and don't need a morning summary. Most of the value here is consolidation, not speed. If you already check the app habitually, the agent is mostly redundant.
  • You don't want to maintain a Playwright-based scraper long-term. This is the single biggest ongoing tax of the whole Lowly project. Be honest about whether you'll actually do the tune-up versus let it rot. A dead LinkedIn scraper fails silently β€” "zero items returned, no error" β€” which is worse than no scraper at all.

What success looks like

Three surfaces of your primary social channel, consolidated and delivered in the same morning thread as the newsfeed:

  1. The home feed. A ranked, deduplicated selection of posts from accounts you follow. Same treatment as RSS items β€” one-sentence extended headline, topic tag, πŸ‘/πŸ‘Ž/πŸ“– buttons.
  2. Filtered notifications. What is happening to you on the platform, not just what is in the feed. Someone commented on your post. Someone reacted to a thing you shared last week. Someone from an unusual company viewed your profile. The raw stream is mostly noise (your-own-post engagement metrics, work-anniversary prompts, trending-content roundups) and most of it gets dropped before the LLM ever sees it. Typically 30 raw notifications shrinks to 2-5 worth surfacing.
  3. Consolidated DMs. New messages grouped by thread and summarized across each thread's unread chunk β€” so a 40-message back-and-forth with one person becomes one item ("3 new messages from {person} re: {topic}") instead of forty. The per-thread summarization is the non-obvious part, and it is the single biggest quality improvement I have made to this extension since it shipped.
  4. (Optional) Profile viewers. The "who viewed your profile" signal, filtered for relevance. Most profile views are uninteresting (recruiters you've already ignored, viewers from your own employer); the relevant subset is usually 0-2 per morning. Skippable if you don't care about inbound-attention signal.

What makes this hard β€” LinkedIn is a dark art across three surfaces

Every obvious approach to LinkedIn fails for a specific reason:

  • The linkedin-api Python library returns CHALLENGE responses from datacenter IPs within minutes of the first successful login, even behind a sticky residential proxy. Cookies invalidate mid-session. The library's author is playing a losing game against LinkedIn's anti-automation team and it shows.
  • Apify and similar scraping APIs return public LinkedIn posts matching a keyword, not your personal feed. They can give you "all posts mentioning AI" but they can't give you "what the people you follow are saying today." Wrong API shape for a personalized digest.
  • RSS feeds for LinkedIn were deprecated years ago. There is no RSS for a home feed.
  • Screen-scraping the mobile app API requires reverse-engineering mutual TLS + signed requests. Full-time job, signatures rotate.

The one path that survives is Playwright driving a persistent Chromium profile authenticated out-of-band β€” and even that is held together with string and JavaScript.

The auth primitive (shared by all four scrapers)

The auth story is the same for every LinkedIn surface:

  1. A one-time interactive auth step (via scripts/linkedin-auth.py) opens a Chromium window through chrome://inspect remote debugging on a local machine. You log in like a human, solve any CAPTCHA, and the resulting Chromium profile directory gets saved to the agent's workspace. That profile directory is the credential; the agent never sees a password.
  2. A host-cron keep-alive (linkedin-keepalive, every 6 hours) pokes the session just often enough that LinkedIn doesn't flag it idle. Without this, the session dies within a day or two and every scraper silently returns empty.
  3. Extraction uses page.evaluate() with stable aria-label selectors. LinkedIn's class names are hashed (_936a7c6b) and rot in weeks. The JS extractors β€” one per surface, committed as scripts/linkedin-*-extract.js β€” target aria-label attributes like "Open control menu for post by {Name}" because those stay stable across LinkedIn's rebuilds. Without that trick, your selectors rot in weeks.

From there, each of the four surfaces layers its own work on top.

Surface 1 β€” the home feed

linkedin-scrape.py launches a Playwright browser against the persistent profile, navigates to linkedin.com/feed/, waits for the post list to render, and calls page.evaluate() to load linkedin-extract.js. The JS walks the DOM and returns an array of {author, text, link, reactions, timestamp, post_type} items. Python filters duplicates against the previous N days of seen posts via linkedin-seen.json (hashes post text + author, keeps the last 500 entries), and the survivors become candidates for the morning edition's "LinkedIn" section.

This is the easy LinkedIn surface. The DOM is relatively consistent day to day, the aria-labels are well-named, the dedup story is straightforward. If you're only scraping the feed, you can probably keep this alive with quarterly touch-ups.

Surface 2 β€” the notifications stream

linkedin-notifs-extract.js walks the /notifications page, which is a different DOM tree with its own obfuscated class names and its own aria-label conventions. The scraper returns {type, actor, text, link, timestamp} for each notification.

The harder part is filtering. LinkedIn's notifications stream is mostly noise β€” on an average morning I get a dozen notifications and maybe two are interesting. The ranking step drops the known-uninteresting categories (your-own-post engagement metrics, work-anniversary prompts, trending-content roundups, "your network is talking about X") and keeps only:

  • Direct interactions with your content β€” comments or reactions on your posts, inbound on things you wrote.
  • Inbound attention from someone new β€” profile views from unusual companies, connection requests from people you don't already know.
  • Responses to threads you're already in β€” replies to comments you left on someone else's post.

The relevant subset typically shrinks 30 raw notifications to 2-5 items worth surfacing. The filter is pure Python β€” no LLM needed β€” though the morning-edition LLM can still choose to promote or demote based on the preference model from Ch 07-2a.

Surface 3 β€” LinkedIn DMs, consolidated and summarized by thread

This is the hardest surface, because it needs summarization across time and because it is the surface where a misclick can send a real message to a real contact (see the big pitfall later in this chapter). A 40-message back-and-forth with one person should arrive as one item in the morning edition, not 40. The shape:

  1. linkedin-messages-extract.js walks the Messages panel and extracts each thread's metadata β€” thread ID, participant name, last-message timestamp, unread count, the full text of any unread messages, and the per-thread URL (item.querySelector('a[href*="messaging"]').href). The URL is the load-bearing field; the Python wrapper uses it to navigate directly to each thread without ever clicking inside the messaging UI. Same aria-label trick for stable selectors elsewhere.
  2. Grouping by thread happens in Python (linkedin-scrape.py), and crucially enrichment is direct navigation, not clicking: for each thread with a real conversation URL, page.goto(thread["url"]) opens it and page.evaluate(...) extracts the bubbles. There is no page.click() and no page.query_selector(text=...) anywhere in the messaging-enrichment loop. Threads whose extractor URL is the bare /messaging/ fallback (date dividers, smart-reply chips, anything without its own anchor) skip enrichment entirely and ship with the preview only. This sounds paranoid β€” it is paranoid for a reason; the alternative shipped two unintended messages to a real contact in 2026-04-14, see the pitfall below.
  3. A network-level send-block runs in main(). page.route() handlers on **/voyager/api/messaging/conversations/** and friends abort POST/PUT/PATCH/DELETE. GETs (which is how the page READS the inbox) pass through. This is defense in depth: if a future regression ever reintroduces a stray click on a smart-reply chip, the resulting send request never leaves the browser, and a [send-block] line goes to stderr as a loud signal. Routes are scoped narrowly β€” global **/* interception slows page loads enough to break extraction (see the second pitfall below).
  4. Summarization uses openclaw infer, not the main LLM cron. A small structured-output prompt takes the thread's unread chunk and returns a one-sentence summary plus a topic tag, via the Python-composer pattern from Ch 06. This runs as a separate inference call rather than inside the main morning-edition LLM cron, because the main cron already has plenty to do and the inference composer is dramatically cheaper for structured-output work.
  5. The summary feeds into cache/morning-items.json as source=linkedin, source_label="LinkedIn Message", with the summarized thread as the extended_headline.

This summarization-via-infer pattern landed recently (commit 6afa924, "restore LinkedIn thread summary via openclaw infer") and it is the single biggest quality improvement I have made to this extension since it shipped. Before it, a morning with three active DM conversations produced a wall of individual unread-message items that each earned their own scroll. After it, the same morning produces three one-line summaries that I can triage in seconds.

The inference calls ride on your OpenAI subscription via openclaw infer β€” same pattern as the preference-learning LLM judge in Ch 07-2a. Zero additional dollars. If you have three active DM threads in a morning, that's three extra tiny structured-output requests, each a small slice of the flat-rate subscription you're already paying for.

Surface 4 β€” profile viewers (optional)

linkedin-viewers-extract.js scrapes the "who viewed your profile" panel. This is a quiet channel: most views are uninteresting (recruiters you've already ignored, viewers from your own employer), and the relevant subset is usually 0-2 per morning. The scraper pulls the data, Python filters the obviously-uninteresting categories, and any survivors get folded into the notifications section as a single aggregated item β€” "2 new profile views this morning: {Company A}, {Company B}."

This is the most skippable of the four surfaces. If you don't care about profile-view signal, delete the scraper and the filter and you lose nothing material.

Deployment additions

Step 3 of the Ch 07-0 arc β€” "Write or port the scripts" β€” is where this chapter lands on top of the existing Lowly Worm deploy. You should already have Ch 07-2a deployed and producing a morning edition before you start.

LinkedIn first-time auth

Before you can scrape anything, you need an authenticated Chromium profile for LinkedIn. This is a one-time interactive step that you run on your laptop, not on the VPS:

  1. Run python3 agents/news-digest/scripts/linkedin-auth.py locally (you need a real browser with a display).
  2. A Chromium window opens at linkedin.com/login. Log in with your real account. Solve any CAPTCHA.
  3. Once you're at the home feed, close the window. The script saves the Chromium profile directory.
  4. Tar the resulting linkedin-profile/ directory and scp it to the VPS at ~/.openclaw/news-digest-workspace/linkedin-profile/. Extract it there.

From then on, linkedin-scrape.py reuses that profile. When it eventually breaks β€” cookies do eventually expire, LinkedIn occasionally invalidates stale sessions en masse β€” you re-run the auth script locally and re-ship the profile. Expect to do this once every few months.

The extra host cron

Adding the social-channel extension adds exactly one host cron to what Ch 07-2a already installs via ops/scripts/install-host-cron.sh:

Host cron Schedule What it does
linkedin-keepalive 0 */6 * * * Runs linkedin-keepalive.py, which pokes the Playwright session just often enough to keep LinkedIn from flagging it idle. 300-second timeout because Playwright is slow to start and slow to navigate.

The linkedin-keepalive entry gets added to install-host-cron.sh alongside the other contract-wrapper crons; re-running install-host-cron.sh picks it up automatically.

Wiring into the morning edition

fetch-and-rank.py in the core Lowly pulls RSS. In the social-extension version, it also:

  1. Launches Playwright with the persistent profile (reusing the keepalive-kept session).
  2. Scrapes the four surfaces β€” feed, notifications, messages, optionally viewers.
  3. Runs the notifications filter (pure Python, no LLM).
  4. Runs the messages summarization (one openclaw infer call per thread with unread messages).
  5. Merges the social items into the ranked output alongside RSS items, tagged with source=linkedin and an appropriate source_label.

The morning-edition LLM cron then selects, writes extended headlines, groups by topic, and writes morning-items.json β€” same as the core Lowly, just with more items flowing through it.

Smoke test

After the extension deploy, run the morning-edition cron manually and verify the LinkedIn items show up in morning-items.json:

oc cron run <morning-edition-uuid>
# Wait ~5 minutes
python3 -c "import json; items = json.load(open('/home/openclaw/.openclaw/news-digest-workspace/cache/morning-items.json')); print([i for i in items if i.get('source') == 'linkedin'])"

You should see LinkedIn items with source_label values like "LinkedIn", "LinkedIn Notification", and "LinkedIn Message". If none of them appear, the Playwright session is probably dead β€” run linkedin-keepalive.py by hand to poke it, and if that doesn't work, re-run linkedin-auth.py locally and re-ship the profile.

Pitfalls you'll hit

🧨 Pitfall. Reaching for linkedin-api, Apify, or any paid LinkedIn scraping service instead of Playwright. Why: official libraries fail on datacenter IPs within minutes (CHALLENGE cookies + session invalidation), and keyword-search APIs return random public posts instead of your actual feed. I tried three before landing on Playwright; none of them gave me my LinkedIn feed, which is what the reader actually wants. How to avoid: go straight to Playwright + persistent Chromium profile. Budget a full evening for the first setup. The JS DOM extractors and aria-label selectors are the non-obvious parts β€” read scripts/linkedin-extract.js before you write anything.

🧨 Pitfall. Docker + --headless=new Chromium + remote debugging port silently fails. Why: Chromium's newer headless mode ignores --remote-debugging-address=0.0.0.0 and binds the debug port to localhost-only inside the container, which means Playwright running inside the same container cannot reach it. The symptom is an unhelpful "connection refused" from Playwright's connect step. How to avoid: run a socat relay inside the container (socat TCP-LISTEN:9222,fork,bind=0.0.0.0 TCP:127.0.0.1:9223) and have Playwright connect to the relay. This is baked into the Dockerfile as a background process.

🧨 Pitfall. (The big one β€” read this even if you skim the rest.) The DM scraper sends real messages to real contacts because page.click() lands on a smart-reply chip instead of a thread list item. Why: LinkedIn renders both the inbox sidebar AND the auto-opened most-recent conversation under [role="main"] simultaneously, and both surfaces use <li> elements. The conversation pane includes LinkedIn's smart-reply chips ("Reply to conversation with 'Nope'", "Reply to conversation with 'Not at all'", etc.), and linkedin-messages-extract.js's mainEl.querySelectorAll('li') sweeps the chips into thread_summaries as if they were inbox threads. The Python wrapper then iterates the result list and calls page.query_selector(f'text="{sender}"').click(force=True) to "find the thread to click" β€” but Playwright's text="..." matcher is global: it returns the first element anywhere on the page whose visible text matches, which is the chip itself. force=True then dispatches the suggested reply directly to LinkedIn's send endpoint. On 2026-04-14, this fired the morning cron sent two real messages ("Nope" and "Not at all") to a Series-D founder before anyone caught it. (A third chip click failed β€” Playwright couldn't find the element after the previous click changed the smart-reply set β€” which is the only reason it wasn't three.) How to avoid: three layers, all of them mandatory.

  1. Direct URL navigation, not clicks. linkedin-messages-extract.js already pulls a per-thread URL into each result item via item.querySelector('a[href*="messaging"]'). Use it. The Python wrapper should page.goto(thread["url"]) and page.evaluate(...) to extract bubbles β€” never query_selector followed by click() inside the messaging view. Direct navigation is strictly safer because it cannot land on the wrong element.
  2. Bare-URL guard. When the JS extractor can't find a thread anchor (which is what happens for date dividers and smart-reply chips β€” they have no <a href> of their own), it falls back to the bare https://www.linkedin.com/messaging/ URL. The Python wrapper must treat that as "skip enrichment" and keep the preview-only entry. The guard is a single line: if not url or url.rstrip("/").endswith("/messaging"): enriched.append(thread); continue. Without this, false positives still reach the page-interaction loop.
  3. Network-level send-block. Even with both layers above, install page.route() handlers on LinkedIn's voyager messaging endpoints β€” **/voyager/api/messaging/conversations/**, **/voyager/api/voyagerMessagingDashMessengerMessages/**, and so on β€” that abort POST/PUT/PATCH/DELETE. GETs pass through (that's how the page READS the inbox). Print a stderr line on every abort. This is the safety net for the day a future regression reintroduces a stray click; the outbound HTTP request gets killed before it leaves the browser, and the stderr line is the loud signal that the safety net fired.

Pin the contract with a regression test that asserts page.query_selector is never called with a text= selector inside the messaging enrichment loop AND that page.goto is called with each thread's URL. See agents/news-digest/tests/test_linkedin_messages_extract.py::test_scrape_messages_navigates_by_url_never_clicks for the canonical shape β€” it is the test that would have caught this bug before it shipped, and it now lives in CI specifically to prevent the same shape from coming back.

🧨 Pitfall. Playwright page.route("**/*", handler) cripples page loading. Why: every request matched by a route pattern round-trips through Python β€” even if the handler just calls route.continue_() immediately. With a global **/* glob, every CSS file, JS bundle, font, image, and XHR pays the IPC cost. On the LinkedIn messaging page (which fires hundreds of requests during initial load) this is enough to push the conversation list past the script's time.sleep(5) wait window, and the JS extractor returns items_found=0. The first draft of the send-block in pitfall above used **/* and broke message extraction on the first deploy β€” [send-block] correctly never fired, but messages also dropped to zero. How to avoid: scope route patterns narrowly to the URL families you actually want to intercept β€” **/voyager/api/messaging/conversations/** and friends β€” so most requests bypass the handler and go straight to the network at full speed. As a rule of thumb: if your page.route() glob would match a .css or .js URL, it's too broad.

🧨 Pitfall. One stuck LinkedIn thread scrape eats the entire morning cron budget. Why: when this scraper still used page.click() to expand each thread (pre-2026-04-14, before the smart-reply incident above forced a rewrite to direct URL navigation), one bad thread β€” a loading spinner that never resolved, a modal that stole focus, a network hang β€” could burn three minutes on Playwright's default click timeout and push the whole cron past its 600-second ceiling. How to avoid: the pitfall is now obsolete on the messaging surface because direct page.goto(thread["url"]) doesn't have the same retry behavior β€” but the lesson generalizes. Any time you're tempted to add a new page.click() call inside a per-item loop, hard-cap the timeout (5 seconds is plenty), don't retry on failure, and accept "we missed one item" as a strictly better outcome than "we missed the whole cron". The 600-second cron ceiling is a real wall.

🧨 Pitfall. LinkedIn session silently expires without the keep-alive. Why: LinkedIn flags sessions as idle after a day or two with no activity and invalidates their cookies. The symptom is subtle: linkedin-scrape.py runs, Playwright navigates to /feed, the DOM loads, the extractor returns zero items, and the morning edition just… doesn't have a LinkedIn section that day. No error, no alert. How to avoid: the linkedin-keepalive host cron every 6 hours is non-optional β€” it exists specifically to prevent this. If you see a morning edition with zero LinkedIn items for two days in a row, run linkedin-keepalive.py by hand; if that doesn't work either, it's time to re-auth.

🧨 Pitfall. Aria-label rot after a LinkedIn UI rebuild. Why: the entire LinkedIn scraper story depends on aria-label attributes being stable across DOM rebuilds. That holds for months at a time and then LinkedIn pushes a UI update and one or two of the labels shift β€” "Open control menu for post by {Name}" becomes "Post actions menu for {Name}" or similar. Your scraper stops returning items for that specific interaction type. The symptom is the same "zero items, no error" shape as session expiration. How to avoid: when zero items returns and the keepalive works, open LinkedIn in your own browser, use DevTools to inspect the buttons you're trying to click, and check the current aria-labels against what your JS extractors expect. This happens once every few months and each fix is a one-line change in one of the linkedin-*-extract.js files.

See also