Skip to Content
Documentation10. Data Extraction API

Data Extraction API

Turn HTML into JSON

The Data Extraction API transforms any Hyperclay site into a queryable data source. Append ?data= to any site URL with extraction rules to get structured JSON.

Example 1: Basic Text Extraction

HTML

<h1>My Tech Blog</h1> <p class="tagline">Latest in technology</p> <p class="copyright">© 2024 My Tech Blog</p>

Query

?data={title:h1,tagline:.tagline,footer:.copyright}

Formatted Query

{ title: h1, tagline: .tagline, footer: .copyright }

Result

{ "title": "My Tech Blog", "tagline": "Latest in technology", "footer": "© 2024 My Tech Blog" }

Example 2: Array Extraction

HTML

<article class="post"> <h2 class="post-title">Understanding JavaScript</h2> <span class="author">Alice Smith</span> <span class="date">2024-01-15</span> </article> <article class="post"> <h2 class="post-title">Python for Data Science</h2> <span class="author">Bob Johnson</span> <span class="date">2024-01-14</span> </article> <article class="post"> <h2 class="post-title">Rust Performance Tips</h2> <span class="author">Carol White</span> <span class="date">2024-01-13</span> </article>

Query

?data={titles:.post-title[],authors:.author[]}

Formatted Query

{ titles: .post-title[], authors: .author[] }

Result

{ "titles": [ "Understanding JavaScript", "Python for Data Science", "Rust Performance Tips" ], "authors": ["Alice Smith", "Bob Johnson", "Carol White"] }

Example 3: Complex Iteration

HTML

<div class="products"> <div class="product" data-id="1"> <h3 class="name">Widget A</h3> <span class="price">$19.99</span> <a href="/products/widget-a">View Details</a> </div> <div class="product" data-id="2"> <h3 class="name">Widget B</h3> <span class="price">$29.99</span> <a href="/products/widget-b">View Details</a> </div> </div>

Query

?data={products:[.product,{name:.name,price:.price,link:a@href,id:@data-id}]}

Formatted Query

{ products: [ .product, { name: .name, price: .price, link: a@href, id: @data-id } ] }

Result

{ "products": [ { "name": "Widget A", "price": "$19.99", "link": "/products/widget-a", "id": "1" }, { "name": "Widget B", "price": "$29.99", "link": "/products/widget-b", "id": "2" } ] }

Syntax Reference

Basic Selectors

  • Tag: h1, p, article
  • Class: .classname
  • ID: #unique-id
  • Current element: . (useful in iterations)

Arrays

  • Add [] suffix to get all matching elements: .post-title[]

Attributes & Properties

  • Use @ to extract attributes: a@href, img@src
  • DOM properties: input@value, @checked, @disabled

Iteration

  • [selector, shape] - Iterate and extract structured data
  • Example: [.post, {title:.title, date:.date}]

Nesting

  • Objects can be nested: {meta:{author:.author,date:.date}}
  • Arrays can contain objects: [.item, {name:.name}]

Response Format

  • Success: Returns extracted JSON data
  • Missing elements: Return null
  • Empty arrays: Return []
  • Errors: Return HTTP status codes with error messages

Caching

Responses are cached for 5 minutes. Check headers:

  • X-Cache: HIT - From cache
  • X-Cache: MISS - Fresh extraction

Use Cases

Simple CMS

Turn any HTML page into a content source. Extract blog posts, product catalogs, or navigation menus to power other applications.

Simple API

Provide structured data access to your Hyperclay sites without building a backend. Perfect for dashboards, integrations, and monitoring.

Limitations

  • Static content only - No JavaScript execution
  • Text and attributes only - Not raw HTML
  • Read-only - Cannot modify sites
Last updated on