Section 2: Data Preparation (14%)
This section focuses on preparing data for Retrieval-Augmented Generation (RAG) applications. It involves chunking documents, filtering extraneous content, choosing appropriate Python packages for document extraction, and evaluating retrieval quality.
2.1 Apply a chunking strategy for a given document structure and model constraints
Key Points:
-
Chunking: Split large documents into smaller pieces that fit within embedding model and LLM context windows.
- Fixed-size chunking: Split by token/character count; simple but can cut mid-sentence.
- Recursive character splitting: Tries to split at paragraph → sentence → word boundaries; most common default.
- Semantic chunking: Uses embedding similarity to group related content; more expensive but higher quality.
- Document-structure-aware chunking: Respects headers, sections, tables; best for structured documents like PDFs or HTML.
-
Chunk size affects retrieval: Larger chunks provide more context but may include irrelevant content; smaller chunks are more precise but may miss context.
-
Overlap: Prevents information loss at boundaries — typical overlap is 10–20% of chunk size.
Read the full article at Towards AI - Medium
Want to create content about this topic? Use Nemati AI tools to generate articles, social posts, and more.

![[AINews] The Unreasonable Effectiveness of Closing the Loop](/_next/image?url=https%3A%2F%2Fmedia.nemati.ai%2Fmedia%2Fblog%2Fimages%2Farticles%2F600e22851bc7453b.webp&w=3840&q=75)



