Clean your generative AI data
Last updated: 24 September 2024
When uploading grounding data (like Word documents, PDFs, etc.) to your generative AI chatbot, it is important to ensure the data is clean and well-organised. Clean data helps the AI provide more accurate and useful responses, improving the overall quality of interactions.
This document outlines the key steps to follow when preparing your data before uploading it to the platform.
Why clean data matters
Generative AI relies on clear, well-structured information to understand and generate accurate responses. Excessive formatting, inconsistent text, or irrelevant information can confuse the AI, leading to lower-quality outputs.
By cleaning your data beforehand, you help the AI focus on the essential content and ensure the responses it generates are relevant and accurate.
Data cleaning best practices
If possible, avoid PDFs. Although our platform supports it, PDFs tend to be poorly formatted.
Particularly avoid PDFs that contain images or scanned text.
Whenever possible, use Word documents.
Remove unnecessary formatting. Avoid using complex formatting as it can interfere with the AI’s ability to process the text properly.
Remove headers and footers, page numbers, footnotes, and any special fonts, colours, or backgrounds.
Keep paragraphs and basic headings.
Simplify and standardise text. Ensure the text is easy to read and consistent throughout the document.
Avoid long paragraphs; break them into smaller, readable chunks.
Use consistent capitalisation and punctuation.
Remove unnecessary symbols, special characters, and emojis.
Replace any shorthand, abbreviations, or jargon with clear, plain language to avoid confusion.
Remove unnecessary content. Ensure that only relevant information is included in the data.
Remove irrelevant sections such as personal notes, comments, or outdated content.
Remove images, tables, and charts (they will not be interpreted).
Remove links.