Evaluating Webpage Fact Extraction with Braintrust - Part 1

Oct 22, 2024

Evaluating Webpage Fact Extraction with Braintrust

Introduction

In AI-driven web scraping, accurately identifying and extracting facts from webpages can be a challenge. Whether it's determining if a page is a blog post, press release, or directory, the data needs to be structured correctly for downstream applications. This is where evaluations come into play, allowing us to measure the accuracy and reliability of the extracted information. In this post, I’ll walk you through how we integrate Braintrust to evaluate prompt-driven web scraping and how we use a custom JSON scorer to ensure that entities and facts are extracted correctly from webpages.

Why Evaluations Matter in Web Scraping

Scraping is more than just grabbing HTML content—it's about identifying and extracting meaningful entities and facts. Here’s why evaluations are crucial:

Ensure accuracy: As we scrape and process data from web pages, we need to verify if the correct page type and details are extracted.
Improve reliability: Constant feedback on the scraping process helps refine the LLM prompts we use, making the system more robust over time.
Automate evaluation: By integrating evaluation into the development loop, we automate checks that would otherwise be manual, saving time and improving iteration cycles.

How We Extract Facts from Web Pages

Here’s the basic process:

Crawl the website: We first gather the web content.
Strip HTML tags: Clean the content to prepare it for language model processing.
Prompt for fact extraction: We use an LLM prompt to extract relevant facts like page type, date, title, and content.

Here’s an example of a prompt we use to identify the type of page:

Analyze the following content and determine if it's a blog post, a press release, a case study, a directory (a page dedicated to links to other pages), or something else. A directory might also be called an index page.
    If it's a blog post, press release, or case study, also extract the date, title, author, and the full content text. Additionally, provide a brief summary of the content.
    Return a JSON object with keys "type", "date", "title", "author", "content", and "summary".
    The "type" should be either "blog_post", "press_release", "case_study", "directory", or "other".
    For "directory" and "other", leave "date", "title", "author", "content", and "summary" as empty strings.
  
    Examples:
    {"type": "blog_post", "date": "2023-05-15", "title": "New Features Announcement", "author": "John Doe", "content": "Full text of the blog post...", "summary": "Brief summary of the blog post..."}
    {"type": "press_release", "date": "2023-06-01", "title": "Company Expansion", "author": "Jane Smith", "content": "Full text of the press release...", "summary": "Brief summary of the press release..."}
    {"type": "case_study", "date": "2023-07-01", "title": "Success Story: Client X", "author": "Alice Johnson", "content": "Full text of the case study...", "summary": "Brief summary of the case study..."}
    {"type": "directory", "date": "", "title": "", "author": "", "content": "", "summary": ""}
    {"type": "other", "date": "", "title": "", "author": "", "content": "", "summary": ""}
  
    Content:

Custom JSON Scorer for Fact Evaluation

To ensure that we’re extracting the right information, we need a robust way to evaluate the data. That’s where a custom JSON scorer comes in. This scorer evaluates two aspects:

Schema Scoring: It checks whether the structure of the JSON object matches the expected schema (e.g., are all the required fields present).
Value Scoring: It compares the actual values for matched keys, ensuring semantic similarity using cosine similarity.

For example, if the scraped page should be a "press release" but is misclassified as a "blog post," or if the extracted title doesn’t match the actual title on the page, the scorer will reflect these discrepancies.

Here’s a quick code snippet of the value scorer in action:

        keys1 = set(json_obj1.keys())
        keys2 = set(json_obj2.keys())
        matched_keys = keys1.intersection(keys2)

        if not matched_keys:
            logger.info("No matched keys between JSON objects. Returning value similarity score of 0.0.")
            return 0.0

        values1 = [str(json_obj1[key]) for key in matched_keys]
        values2 = [str(json_obj2[key]) for key in matched_keys]

        embeddings1 = self.get_embeddings_cached(values1)
        embeddings2 = self.get_embeddings_cached(values2)

        similarity_scores = [
            self.cosine_similarity(emb1, emb2) for emb1, emb2 in zip(embeddings1, embeddings2)
        ]

        average_similarity = sum(similarity_scores) / len(similarity_scores) if similarity_scores else 0.0
        logger.debug(f"Average value similarity: {average_similarity}")
        return average_similarity

By integrating the JSON scorer with Braintrust, we can automatically evaluate the accuracy of the extracted facts. This allows us to iterate on the scraping process with real-time feedback.

Conclusion

Fact extraction from webpages is a powerful tool, but it’s only as reliable as the evaluation process you use. By integrating Braintrust and building custom scoring mechanisms, you can ensure that the entities and facts we scrape are accurate and meaningful. As LLMs become more integrated into workflows, effective evaluation will be crucial to building reliable and high-quality AI-driven solutions.