ChatGPT, Document Analysis, and the Challenge of 'Skewed' Data

ChatGPT, especially with the advancements in GPT-4 and beyond, is getting smarter, faster, and more powerful in ways that were unimaginable just a few years ago. The next iterations—whatever OpenAI calls them—are set to push those limits even further. And while I’ve seen firsthand how impressive these models can be, I’m also thinking a lot about how we get ChatGPT to reason accurately, particularly when analyzing research and government documents.

The Problem: AI’s Internal Logic vs. Reliable Research

Here’s the challenge: ChatGPT has its own internal logic. It relies on a large language model that processes and generates information based on patterns, training data, and inference. But in research—such as when analyzing government documents from UC Berkeley’s Institute of Governmental Studies Library (IGSL)—you can’t always depend on inference alone.

For example, I’ve been exploring ways to analyze documents through different cultural and jurisdictional lenses—say, someone from Hong Kong comparing California’s policies to their own. To do this properly, I build knowledge profiles and structured data models to filter ChatGPT’s reasoning. A simple spreadsheet, paired with the right prompting techniques, can go a long way.

But even with the best framework in place, there’s an issue:

What Happens When the Data Itself Is Skewed?

Government documents—especially older reports, policies, and planning materials—aren’t always structured in a way that makes sense, even to a machine. Imagine trying to read a book where every few pages, the story resets to a different chapter, some pages are sideways, and entire sections repeat or disappear altogether. That’s what happens with some of these historical government documents.

None of these issues are mistakes by UC Berkeley or IGSL, but rather the natural complexities of historical government documents.

For instance, a single government publication might combine multiple reports into one, where a city’s transportation policy from 1982 is stapled right next to a housing policy from 1995, with no clear indicator that these are separate documents. If ChatGPT is simply reading top-to-bottom, it might treat them as one continuous thought, drawing conclusions that were never meant to go together.

Other documents might have text that jumps around, such as a table of contents on page 10 instead of at the beginning or a key finding buried on page 50 but referenced on page 2. To a human, this is annoying but manageable. But for a machine, it presents a complex challenge of maintaining context while recognizing that not all information flows in a straight line.

And then there’s the issue of page formatting itself—handwritten reports, typewritten text that didn’t scan cleanly, or pages numbered incorrectly so that page 42 somehow ends up between pages 17 and 18. A machine can’t always recognize these misalignments unless it has a structured way to check them.

This isn’t the fault of UC Berkeley’s collection. It’s simply a reality of working with older government documents that were created before modern word processors or standardized formatting. Some were typed on early computers or even handwritten, while others were assembled as multi-part publications, meaning they were related but didn’t necessarily flow in a clear sequence.

Testing Solutions: A Pre-Processing Framework for Skewed Data

Right now, I’m testing a new framework that helps ChatGPT:

✅ Scan for structural inconsistencies first, before jumping into analysis.
✅ Red-flag pages with misaligned data, so a human can review them.
✅ Identify and flag handwritten text, such as marginal notes or annotations, that may provide additional context or require human verification.
✅ Justify why a page or section was flagged, instead of just skipping content.

The idea is simple: before ChatGPT makes conclusions, it should verify the integrity of what it’s reading. This isn’t just about pagination or formatting. A truly intelligent system should be able to:
🔹 Recognize when text flows logically—independent of page numbers.
🔹 Identify misaligned sections without falsely assuming they are errors.
🔹 Detect when multiple reports exist within a single publication and analyze them separately.

That requires more than just reading—it requires reasoning beyond a simple linear process.

Final Thoughts

I believe that while using AI to extract insights, it's important to also ensure that the data it’s analyzing is in order. If a document has flaws or contains nuanced data—such as handwritten notes, overlapping sections, or inconsistent formatting—ChatGPT needs a way to recognize and account for these complexities in its output rather than simply processing them as standard text—not just infer meaning from that text.

This way of analyzing documents is a work in progress, but it’s exciting to see how even small tweaks in how we structure and guide ChatGPT’s reasoning can make a huge difference in research accuracy.

My name is Nick, and I enjoy teaching and speaking about the intersection of research, ChatGPT, and prompt engineering. My work focuses on developing easy-to-use frameworks and strategies that ensure AI doesn’t just generate answers, but also verifies and checks itself—helping researchers use ChatGPT more effectively and responsibly. If you have questions, need help setting up, or want to improve your prompts, feel free to reach out—I’d love to help!