Automating YOLO Synthetic Data Generation with AI Tools

Learn how to automate the creation of labeled datasets for computer vision tasks using AI-powered synthetic data generation.

Creating high-quality labeled datasets for computer vision tasks has always been a tedious and time-consuming process. Traditionally, you'd have to manually draw bounding boxes on thousands of images, which is both expensive and slow. But thanks to AI tools like Cursor and Composer, we can now automate a big chunk of this workflow. By using synthetic data generation techniques, we can efficiently produce structured YOLO datasets without all the manual hassle.

Watch it in Action

See how we use AI tools to automate the YOLO dataset generation process in this video or read the blog post.

▶️ Watch the full demonstration of our automated YOLO dataset generation process

The Problem: Labeled UI Elements for Computer Use APIs

Instead, we can use AI-generated synthetic data to create robust training samples. This approach lets us: In my work building computer use agent APIs, I need a lot of training data that represents various UI elements—things like buttons, input fields, rows, and containers. While there are some datasets out there, they're often incomplete, and manually labeling UI components is just too slow.

  • 1Generate labeled UI screenshots without manual annotation.
  • 2Capture multiple variations of elements to improve model generalization.
  • 3Train YOLO models efficiently with minimal human effort.

Interactive Visualization

Explore how our YOLO model detects and classifies different UI elements. Toggle between the before and after states to see how the model improves its detection accuracy, and use the filters to focus on specific element types.

Slack Interface Screenshot

Detection List

  • list_section90%
  • form_section89%
  • scroll_section88%
  • card_section78%
  • list_section61%
  • content_row96%
  • data_row89%
  • content_row88%
  • data_row84%
  • header_row77%
  • content_row72%
  • data_row72%
  • header_row62%
  • header_row59%
  • data_row53%
  • button93%
  • button91%
  • button91%
  • button90%
  • text_label89%
  • button89%
  • button89%
  • button88%
  • button88%
  • button86%
  • button86%
  • button85%
  • button84%
  • button83%
  • button82%
  • button82%
  • button81%
  • button81%
  • button80%
  • text_label79%
  • button78%
  • button78%
  • button78%
  • button77%
  • icon77%
  • button75%
  • button74%
  • button74%
  • button73%
  • text_input73%
  • button71%
  • button70%
  • button67%
  • text_input66%
  • text_label65%
  • button65%
  • text_label64%
  • text_label64%
  • button61%
  • button61%
  • media_element60%
  • text_label59%
  • button58%
  • text_label58%
  • button58%
  • button57%
  • button57%
  • button56%
  • button56%
  • button55%
  • button55%
  • button54%
  • text_input50%
  • button45%
  • button45%
  • button42%
  • text_label40%
  • icon39%
  • button39%
  • text_label38%
  • text_label38%
  • text_label37%
  • text_label35%
  • icon34%
  • text_label34%
  • icon33%
  • button33%
  • button31%
  • button30%
  • text_label29%
  • button29%
  • icon29%
  • text_label29%
  • icon27%
  • text_input27%
  • text_label25%
  • button25%
  • button24%
  • text_label21%
  • text_label21%
  • icon21%
  • button21%
  • icon21%
  • media_element20%
  • button20%
  • button20%
  • icon17%
  • button16%
  • icon15%
  • button15%
  • icon15%
  • media_element15%
Containers
Rows
Elements

Step 1: Taking a Screenshot and Generating a Mockup

The process starts with a screenshot of an application UI that we want to analyze. Let's say a customer needs automation for Slack. We take a screenshot of Slack and use it as the foundation for generating training data.

Using Cursor (or Composer), we feed the screenshot into the tool and ask it to generate an HTML + CSS representation of the UI. This lets us replicate the UI structure programmatically. Here's how it works:

  • Attach the screenshot to Cursor.
  • Instruct it to generate an index.html file and corresponding CSS that matches the UI.
  • Make sure the elements have meaningful, structured naming conventions that align with YOLO labels.

Step 2: Enhancing and Iterating the Synthetic UI

The AI-generated HTML representation isn't perfect on the first try. Using Composer's powerful editing capabilities, we can quickly refine and iterate on the generated UI. Cursor understands our refinement requests in natural language, making it easy to generate variations and fix issues:

Refinement Steps

  • Ask Composer to adjust element positioning to match the real UI.
  • Use natural language to request theme variations and layout changes.
  • Let Composer generate multiple versions with diverse content.

Slack UI Example

  • !The model initially missed some container elements.
  • !The rich text editor's buttons were incomplete.
  • !The sidebar structure needed some tweaks.

Step 3: Extracting Bounding Box Data with Playwright

Once we have an accurate HTML representation, we use Playwright to render the page in a headless browser and extract bounding boxes for each UI element. To ensure our model works across different screen sizes, we render each page in multiple resolutions:

Resolution Variations

  • 💻Desktop: 1920x1080, 1440x900
  • 💻Laptop: 1366x768, 1280x800
  • 📱Tablet/Mobile: 1024x768, 768x1024

Benefits

  • Captures responsive layout changes
  • Improves detection across screen sizes
  • Ensures consistent element recognition

Extraction Process

  • 1
    Render the synthetic UI in a headless browser at each target resolution.
  • 2
    Use JavaScript to query elements and extract their x, y, width, and height properties, accounting for responsive layout changes.
  • 3
    Map these coordinates to YOLO format, creating separate annotation sets for each resolution.

Step 4: Creating a YOLO Dataset

With the bounding box information extracted, we finalize the YOLO dataset by:

Dataset Creation

  • 1Organizing images and annotations in the standard YOLO format.
  • 2Ensuring class labels are consistent with the dataset schema.
  • 3Adding variations in text, themes, and UI structure for robustness.

Slack Dataset Example

  • Used different UI themes to enhance model generalization.
  • Randomized usernames and channel names for variability.
  • Added UI variations like message threads and emojis.

Conclusion: The Power of Synthetic Data for AI Training

By combining AI-generated UI mockups with automated bounding box extraction, we can dramatically speed up dataset creation for training YOLO models. This approach offers:

Scalability

Generate thousands of labeled samples with minimal effort.

Diversity

Ensure datasets include multiple UI states and themes.

Efficiency

Reduce manual annotation work through automation.

This method isn't just limited to Slack—it can be applied to any application, whether it's a web app, mobile UI, or even a desktop interface.