Validated web extraction for agents that need evidence
Crawl pages, diagnose source readiness without an LLM call, then extract typed JSON through your own OpenAI-compatible endpoint. Contract mode adds validation.valid, required fields, missing-field evidence, and a recommended next step.
Featured contract capabilities
First 5-minute test
Apify dry run
{
"urls": ["https://example.com"],
"extractMarkdown": true,
"dryRun": true
}Dry run crawls the page without emitting an Apify billing event.
Hosted diagnostic API
curl https://www.miaibot.ai/api/v1/diagnose \
-H "Authorization: Bearer $MAIBOT_KEY" \
-H "Content-Type: application/json" \
-d '{"urls":["https://example.com"],"extractContract":"real-estate-listing"}'The diagnostic endpoint uses the built-in contract readiness check and does not require an LLM endpoint.
Current GitHub source
git clone https://github.com/manchittlab/TheCrawler.git cd TheCrawler/engine npm install npm run build
Use GitHub source for the newest diagnostic/MCP tools. npm is still older.
Diagnose first
Score each source before extraction. The result tells an agent whether the URL is ready, blocked, failed, or too thin.
Extract with a contract
Use the built-in real-estate-listing contract to get normalized JSON, required-field validation, and missing-field evidence.
Report the blockers
Generate a Markdown readiness report for a workflow without including raw contact details or raw page evidence.
What the agent gets back
{
"workflowVerdict": "mixed",
"readyUrls": 1,
"blockedUrls": 1,
"recommendedNextStep": {
"action": "extract-ready-subset"
}
}{
"validation": {
"valid": true,
"requiredFields": ["title", "price", "location"],
"missingRequiredFields": []
}
}Local validation used a Rightmove + Realtor workflow: one ready source, one rate-limited source, and a recommendation to extract the ready subset before expanding automation. This is not a claim that every real-estate site works out of the box.
$500 extraction readiness sprint
Send one public web-data workflow. You get a 24-hour readiness report that says which URLs are extract-ready, which are blocked, which fields are missing, and which stack path is sensible: TheCrawler, a Firecrawl-style self-serve API, a custom browser workflow, or no automation for that source.
The $500 sprint is paid by one-off link or invoice after fit confirmation. If we continue into setup or hosted usage, that $500 is credited toward the next step. If another tool is the better path, the report says so.
View redacted proof packIncluded
- - Up to 25 public URLs
- - One target output shape
- - Ready / mixed / blocked verdict
- - Field-readiness map
- - Markdown report + compact JSON evidence
- - Stack recommendation: TheCrawler, Firecrawl-style API, custom browser workflow, or do not automate
Boundaries
- - No login, paywall, or private-data targets
- - No anti-bot bypass promise
- - No guarantee every source extracts cleanly
- - Paid only after fit confirmation
- - $500 credited toward setup or hosted usage if we continue
Use GitHub only for workflows that can be discussed publicly. If the workflow cannot be public, use the private fit-check email button. If the scope fits, we send a one-off payment link or invoice. Work starts after payment clears and the final URL/field list is confirmed.