Data cleaning basics

Data retrieval is only as good as the quality of the data the system is fed. Not even the best frontier models can read minds; they might be prone to inventing things, but that’s not the same thing. If you need to find specific information among your mountains of data, it needs to be cleaned first. This is where the preprocessing pipeline comes in.

You could have duplicate files, malformed data, or just totally irrelevant information clogging up your queries. Cleaning must be done early and often.

Identification

Before you can clean anything, you have to know what you have. That means cataloguing your data sources: shared drives, email threads, CRMs, wikis, chat logs, PDFs gathering dust in folders no one’s opened in three years. Most businesses accumulate information like sediment on a riverbed––slowly, constantly, and without intent. The identification step is about surfacing all of it and deciding what’s actually worth ingesting.

This is also where you establish deduplication boundaries before anything enters the pipeline. Ingesting the same document twice doubles your noise, not your signal. A canonical source needs to be established for every piece of information. If the same policy document lives in three different folders in slightly different versions, you want one, and you want the right one.

Parsing

Most real-world business data isn’t clean plaintext sitting in a database: it’s PDFs with two-column layouts, Word documents with tracked changes, spreadsheets with merged cells, scanned images that have never been OCR’d. Parsing is the step that converts all of that into something a language model can actually read.

Beyond format conversion, this is where you strip the structural noise: page headers and footers, navigation menus, boilerplate legal disclaimers that appear on every document, auto-generated timestamps, and whatever else the source system decided to staple to the content. These elements confuse retrieval without contributing meaning. A chunk of text that opens with “Page 4 of 17 | Confidential | Internal Use Only” is going to score similarity hits on every other document with that same header.

Cleaning and Normalizing

Parsing gets you plaintext. Cleaning gets you useful plaintext. Duplicates can survive the identification step––two documents with different filenames and different creation dates but identical content will slip through if you’re not checking at the content level. Fuzzy deduplication (detecting near-duplicates, not just exact copies) is often necessary here, since people love to copy-paste documents and change one line.

Low-value content is anything that adds tokens without adding information: auto-replies, out-of-office messages ingested from email threads, template boilerplate, heavily formatted tables that lost their structure during parsing and are now just columns of random numbers. Filtering these out at this stage keeps your vector store lean. This is also the moment to enrich with metadata: document source, date, author, department, content type, etc. That metadata travels with every chunk and becomes critical for filtering results.

Chunking

The language model doesn’t retrieve whole documents. It retrieves chunks––discrete passages of text that are small enough to fit cleanly into context alongside other relevant results, but large enough to carry a coherent idea. Too small, and you’re retrieving sentence fragments that don’t stand alone. Too large, and your chunks become too broad to rank well against a specific query.

Once the chunks are sized appropriately, each one gets converted into a vector––a numerical representation of its semantic meaning. This is the embedding step, and it’s what makes similarity search possible. Rather than matching keywords, the retrieval system finds chunks that are conceptually close to your query, even if they don’t share a single word with it. Those vectors, paired with their metadata and a reference back to the source document, get written to a vector store. That’s the index your RAG system will query at runtime.

Done right, the pipeline is invisible to the end user. They ask a question, the system finds the right content, and the model synthesizes a grounded answer. Done wrong, it becomes a liability––confident responses built on top of garbage data. The pipeline is unglamorous work, but it’s where the quality of your RAG system is actually determined.

This whole thing is a laborious process. Many of our clients rely on us for cleaning their data, too. Reach out to us and schedule a call if you’d like to learn more.

We’ll be writing in greater detail on each step of the cleaning process in future posts.