Data Extraction API
Turn HTML into JSON
The Data Extraction API transforms any Hyperclay site into a queryable data source. Append ?data=
to any site URL with extraction rules to get structured JSON.
Example 1: Basic Text Extraction
HTML
<h1>My Tech Blog</h1>
<p class="tagline">Latest in technology</p>
<p class="copyright">© 2024 My Tech Blog</p>
Query
?data={title:h1,tagline:.tagline,footer:.copyright}
Formatted Query
{
title: h1,
tagline: .tagline,
footer: .copyright
}
Result
{
"title": "My Tech Blog",
"tagline": "Latest in technology",
"footer": "© 2024 My Tech Blog"
}
Example 2: Array Extraction
HTML
<article class="post">
<h2 class="post-title">Understanding JavaScript</h2>
<span class="author">Alice Smith</span>
<span class="date">2024-01-15</span>
</article>
<article class="post">
<h2 class="post-title">Python for Data Science</h2>
<span class="author">Bob Johnson</span>
<span class="date">2024-01-14</span>
</article>
<article class="post">
<h2 class="post-title">Rust Performance Tips</h2>
<span class="author">Carol White</span>
<span class="date">2024-01-13</span>
</article>
Query
?data={titles:.post-title[],authors:.author[]}
Formatted Query
{
titles: .post-title[],
authors: .author[]
}
Result
{
"titles": [
"Understanding JavaScript",
"Python for Data Science",
"Rust Performance Tips"
],
"authors": ["Alice Smith", "Bob Johnson", "Carol White"]
}
Example 3: Complex Iteration
HTML
<div class="products">
<div class="product" data-id="1">
<h3 class="name">Widget A</h3>
<span class="price">$19.99</span>
<a href="/products/widget-a">View Details</a>
</div>
<div class="product" data-id="2">
<h3 class="name">Widget B</h3>
<span class="price">$29.99</span>
<a href="/products/widget-b">View Details</a>
</div>
</div>
Query
?data={products:[.product,{name:.name,price:.price,link:a@href,id:@data-id}]}
Formatted Query
{
products: [
.product,
{
name: .name,
price: .price,
link: a@href,
id: @data-id
}
]
}
Result
{
"products": [
{
"name": "Widget A",
"price": "$19.99",
"link": "/products/widget-a",
"id": "1"
},
{
"name": "Widget B",
"price": "$29.99",
"link": "/products/widget-b",
"id": "2"
}
]
}
Syntax Reference
Basic Selectors
- Tag:
h1
,p
,article
- Class:
.classname
- ID:
#unique-id
- Current element:
.
(useful in iterations)
Arrays
- Add
[]
suffix to get all matching elements:.post-title[]
Attributes & Properties
- Use
@
to extract attributes:a@href
,img@src
- DOM properties:
input@value
,@checked
,@disabled
Iteration
[selector, shape]
- Iterate and extract structured data- Example:
[.post, {title:.title, date:.date}]
Nesting
- Objects can be nested:
{meta:{author:.author,date:.date}}
- Arrays can contain objects:
[.item, {name:.name}]
Response Format
- Success: Returns extracted JSON data
- Missing elements: Return
null
- Empty arrays: Return
[]
- Errors: Return HTTP status codes with error messages
Caching
Responses are cached for 5 minutes. Check headers:
X-Cache: HIT
- From cacheX-Cache: MISS
- Fresh extraction
Use Cases
Simple CMS
Turn any HTML page into a content source. Extract blog posts, product catalogs, or navigation menus to power other applications.
Simple API
Provide structured data access to your Hyperclay sites without building a backend. Perfect for dashboards, integrations, and monitoring.
Limitations
- Static content only - No JavaScript execution
- Text and attributes only - Not raw HTML
- Read-only - Cannot modify sites
Last updated on