pluckr

Schema-first, self-healing HTML extraction via LLMs

TypeScriptZodVercel AI SDKPlaywrightSQLiteRedis

the problem

web scraping is fragile. you write CSS selectors, the site changes its markup, your scraper breaks. you fix it, it breaks again. repeat forever.

what it does

define the data you want with a zod schema. pluckr figures out how to extract it. when pages change, it self-heals. no selectors to write or maintain.

how it works

  1. you pass raw HTML and a zod schema describing the shape of data you want
  2. an LLM generates CSS selectors for each field through an agentic tool loop
  3. selectors are tested against the actual page before being committed
  4. working selectors get cached. repeat extractions are instant, zero LLM calls
  5. if a selector breaks because the page changed, pluckr detects it and auto-regenerates

why sqlite + redis

most scraping runs are one machine, one script. sqlite is perfect. but for distributed setups you need shared cache, so redis is an option too. both ship as separate packages (@pluckr/sqlite, @pluckr/redis).

works with any LLM

built on the vercel AI SDK, so you can plug in openai, anthropic, google, or any compatible provider. the agentic loop works the same regardless of the model behind it.

← back home