Skip to Content
DocumentationData Extraction API

Data Extraction API

Turn HTML into JSON

The Data Extraction API transforms any Hyperclay site into a queryable data source. Append ?data= to any site URL with extraction rules to get structured JSON.

Example 1: Basic Text Extraction

HTML

<h1>My Tech Blog</h1> <p class="tagline">Latest in technology</p> <p class="copyright">© 2024 My Tech Blog</p>

Query

?data={title:"h1",tagline:".tagline",footer:".copyright"}

Formatted Query

{ title: "h1", tagline: ".tagline", footer: ".copyright" }

Result

{ "title": "My Tech Blog", "tagline": "Latest in technology", "footer": "© 2024 My Tech Blog" }

Example 2: Array Extraction

HTML

<article class="post"> <h2 class="post-title">Understanding JavaScript</h2> <span class="author">Alice Smith</span> <span class="date">2024-01-15</span> </article> <article class="post"> <h2 class="post-title">Python for Data Science</h2> <span class="author">Bob Johnson</span> <span class="date">2024-01-14</span> </article> <article class="post"> <h2 class="post-title">Rust Performance Tips</h2> <span class="author">Carol White</span> <span class="date">2024-01-13</span> </article>

Query

?data={titles:".post-title[]",authors:".author[]"}

Formatted Query

{ titles: ".post-title[]", authors: ".author[]" }

Result

{ "titles": [ "Understanding JavaScript", "Python for Data Science", "Rust Performance Tips" ], "authors": ["Alice Smith", "Bob Johnson", "Carol White"] }

Example 3: Complex Iteration

HTML

<div class="products"> <div class="product" data-id="1"> <h3 class="name">Widget A</h3> <span class="price">$19.99</span> <a href="/products/widget-a">View Details</a> </div> <div class="product" data-id="2"> <h3 class="name">Widget B</h3> <span class="price">$29.99</span> <a href="/products/widget-b">View Details</a> </div> </div>

Query

?data={products:[".product",{name:".name",price:".price",link:"a@href",id:"@data-id"}]}

Formatted Query

{ products: [ ".product", { name: ".name", price: ".price", link: "a@href", id: "@data-id" } ] }

Result

{ "products": [ { "name": "Widget A", "price": "$19.99", "link": "/products/widget-a", "id": "1" }, { "name": "Widget B", "price": "$29.99", "link": "/products/widget-b", "id": "2" } ] }

Fancy Example

https://panphora.hyperclay.com/data?data={ siteName: ".site-name", description: ".site-description", newsletterDescription: ".newsletter-description", posts: ["[post]",{ date: ".post-date@date", description: ".post-description", note: ".post-note", projects: [ "[project]", { type: ".project-type@project_type", name: ".project-name", url: ".project-name@href", description: ".project-description" } ] }] }

Try it out: Go to panphora.hyperclay.com/data?data… 

Syntax Reference

Basic Selectors

  • Tag: "h1", "p", "article"
  • Class: ".classname"
  • ID: "#unique-id"
  • Attribute: "[attr]", "[attr='value']" (e.g., "[post]", "[data-id='1']")
  • Current element: "." (useful in iterations)

Arrays

  • Add [] suffix to get all matching elements: ".post-title[]"

Attributes & Properties

  • Use @ to extract attributes: "a@href", "img@src"
  • DOM properties: "input@value", "@checked", "@disabled"
  • Extract from custom attributes: ".post-date@date", ".project-type@project_type"

Iteration

  • [selector, shape] - Two-element array where:
    • First element: selector to iterate over (string)
    • Second element: shape object to extract from each match
  • Example: ["[post]", {title: ".title", date: ".date"}]
  • Selectors are scoped to each matched element

Nesting

  • Objects can be nested: {meta: {author: ".author", date: ".date"}}
  • Arrays can contain objects: [".item", {name: ".name"}]
  • Arrays can be nested: ["[post]", {projects: ["[project]", {name: ".name"}]}]

Important: Quoting in URLs

All selectors must be quoted strings in the actual URL. The examples below show both formatted (for readability) and actual URL syntax.

Formatted (for documentation):

{ posts: [".post", {title: ".title", date: ".date"}] }

Actual URL (what you type):

?data={posts:[".post",{title:".title",date:".date"}]}

Response Format

  • Success: Returns extracted JSON data
  • Missing elements: Return null
  • Empty arrays: Return []
  • Errors: Return HTTP status codes with error messages

Caching

Responses are cached for 5 minutes. Check headers:

  • X-Cache: HIT - From cache
  • X-Cache: MISS - Fresh extraction

Use Cases

Simple CMS

Turn any HTML page into a content source. Extract blog posts, product catalogs, or navigation menus to power other applications.

Simple API

Provide structured data access to your Hyperclay sites without building a backend. Perfect for dashboards, integrations, and monitoring.

Limitations

  • Static content only - No JavaScript execution
  • Text and attributes only - Not raw HTML
  • Read-only - Cannot modify sites
Last updated on