Text Chunking
This guide explains how TMGMT Lara Translate handles large content by splitting it into chunks that fit within Lara's API limits.
Why Text Chunking is Needed
Lara API Limitations
- 10,000 character limit per translation request
- HTML included in character count (not just text content)
- Safety buffer: Uses 9,900 character limit (100-char margin)
- Timeout prevention: Avoids API timeouts with large content
Content That Exceeds Limits
- Long articles and documentation pages
- HTML-rich content with many tags
- Product descriptions with detailed specifications
- Multi-page content combined into single items
Text Chunking Service
RecursiveCharacterTextSplitter
The module uses a sophisticated text splitter based on LangChain's algorithm:
Key Features: - Hierarchical splitting: Paragraphs → Lines → Words → Characters - Context preservation: Overlapping chunks maintain translation context - Language-aware: Different splitting patterns for code vs prose - UTF-8 safe: Proper multibyte character handling - Configurable: Customizable chunk sizes and separators
Default Configuration for Lara
$splitter->configure([
'chunk_size' => 9900, // 100-char safety buffer
'chunk_overlap' => 200, // Context overlap
'separators' => ["\n\n", "\n", " ", ""], // Hierarchical
'keep_separator' => KeepSeparator::End,
]);
Chunking Process
1. Length Detection
The module checks total HTML length (including tags):
$totalLength = $this->textSplitter->getTextLength($htmlContent);
if ($totalLength > self::MAX_LENGTH) {
// Use chunking
$chunks = $this->splitter->splitText($htmlContent);
} else {
// No chunking needed
$chunks = [$htmlContent];
}
2. Hierarchical Splitting
Algorithm tries separators in order:
- Paragraphs (
\n\n): Largest semantic units - Lines (
\n): Smaller units - Spaces (
): Individual words - Characters (`""): Last resort for very long words
3. Overlap Handling
// Example: 200-character overlap
$chunks = [
"First 200 characters of long paragraph",
"Last 1800 characters with 200-char overlap from beginning"
];
4. Reassembly
After translation, chunks are reassembled:
$translatedHtml = $this->textSplitter->reassembleChunks($translatedChunks);
Language-Specific Splitting
HTML Content
$htmlSplitter = $this->textSplitter->forLanguage('html', [
'chunk_size' => 9900,
'chunk_overlap' => 200,
]);
$chunks = $htmlSplitter->splitText($htmlContent);
Splits at block element boundaries (<div>, <p>, etc.) when possible.
Markdown Content
$mdSplitter = $this->textSplitter->forLanguage('markdown');
$chunks = $mdSplitter->splitText($markdown);
Recognizes:
- Headers (#, ##, ###)
- Code blocks (`)
- Horizontal rules (---)
- List items
Code Content
$phpSplitter = $this->textSplitter->forLanguage('php');
$chunks = $phpSplitter->splitText($phpCode);
Splits at:
- Function boundaries
- Class declarations
- Method signatures
- Code blocks
Configuration Options
Chunk Size Settings
Recommended sizes for different scenarios:
| Content Type | Chunk Size | Overlap | Use Case |
|---|---|---|---|
| Short articles | 5000 | 200 | Fast processing |
| Long documentation | 8000 | 400 | Balance speed/context |
| Technical specs | 9900 | 500 | Maximum safety |
| Product pages | 6000 | 300 | Good balance |
Separator Strategies
KeepSeparator::End (Default)
- Separators removed from chunks
- Added to reassembly
- Clean final output
KeepSeparator::Yes
- Separators kept as separate elements
- Useful for preserving structure
- Example:
["<p>Sentence 1</p>", "<p>Sentence 2</p>"]
KeepSeparator::No
- Separators discarded completely
- Continuous text flow
- Example:
"Sentence 1. Sentence 2"
Custom Length Functions
// Token-based counting instead of characters
$splitter->configure([
'length_function' => function(string $text): int {
return count(explode(' ', $text));
},
]);
Translation Flow with Chunking
flowchart TD
A[TMGMT Job Item] --> B{Content > 9900 chars?}
B -->|No| C[Direct Translation]
B -->|Yes| D[Create RecursiveCharacterTextSplitter]
D --> E[Split into chunks]
E --> F[Translate each chunk]
F --> G[Reassemble chunks]
G --> H[Save translated data]
C --> H
F --> I{All chunks successful?}
I -->|Yes| G
I -->|No| J[Job Item Failed]
Performance Considerations
Memory Efficiency
- Chunk-by-chunk processing: Never loads entire document
- Streaming approach: Minimal memory footprint
- Garbage collection: Automatic cleanup of temporary data
Processing Time
- Linear scaling: Time proportional to content size
- API limits: Each chunk respects 10k character limit
- Parallel potential: Chunks could be translated concurrently
Quality Impact
- Context preservation: 200-char overlap maintains meaning
- Semantic boundaries: Paragraph/line splitting reduces translation fragmentation
- Reassembly fidelity: Exact reassembly preserves structure
Troubleshooting Chunking
Issues and Solutions
Poor Splitting Quality
Problem: Chunks split in mid-sentence or mid-word
Solutions:
// Increase overlap for better context
$splitter->configure(['chunk_overlap' => 500]);
// Use different separator order
$splitter->configure(['separators' => [". ", "? ", " ", ""]]);
Too Many Small Chunks
Problem: Excessive API calls, slow processing
Solutions:
// Increase chunk size
$splitter->configure(['chunk_size' => 9900]);
// Use more aggressive separators
$splitter->configure(['separators' => ["\n\n", "\n\n", "\n", " "]]);
Reassembly Issues
Problem: Missing content or broken HTML after reassembly
Solutions:
// Use End separator strategy
$splitter->configure(['keep_separator' => KeepSeparator::End]);
// Validate reassembled content
$validator = \Drupal::service(TextSplitterValidator::class);
$isValid = $validator->validateHtml($reassembled);
Best Practices
Configuration Guidelines
- Start with defaults: 9900 chars, 200 overlap
- Monitor performance: Adjust based on your content patterns
- Test with samples: Verify chunking quality before production
- Consider content type: Use language-specific modes
Content Preparation
- Clean HTML: Remove unnecessary attributes and comments
- Logical structure: Use semantic HTML elements properly
- Avoid deep nesting: Simpler structure chunks better
- Consistent formatting: Use same HTML patterns
Monitoring Tips
# Monitor chunking behavior
drush watchdog:show --type=tmgmt_laratranslate | grep -i "chunk\|split"
# Check chunk sizes
drush tmgmt-job-item:list --format=yaml | grep "word_count"
# Monitor translation quality
drush tmgmt-job-item:list --status=completed
Advanced Usage
Custom Splitter Configuration
// For very long technical documents
$technicalSplitter = $this->textSplitter->configure([
'chunk_size' => 9900,
'chunk_overlap' => 500,
'separators' => ["\n\n", "\n", ". ", "; ", " ", ""],
'keep_separator' => KeepSeparator::End,
'length_function' => 'count_words', // Custom counting
]);
// For legal documents requiring sentence integrity
$legalSplitter = $this->textSplitter->configure([
'chunk_size' => 7000,
'chunk_overlap' => 300,
'separators' => [". ", "! ", "? ", " ", ""],
'keep_separator' => KeepSeparator::End,
]);
Testing Chunking
// Test splitting quality
$splitter = \Drupal::service(RecursiveCharacterTextSplitter::class);
$testText = str_repeat('Test paragraph. ', 200);
$chunks = $splitter->splitText($testText);
// Verify chunk sizes
foreach ($chunks as $i => $chunk) {
$length = $splitter->getTextLength($chunk);
echo "Chunk " . ($i + 1) . ": {$length} chars\n";
}
// Test reassembly
$reassembled = $splitter->reassembleChunks($chunks);
echo "Original length: " . $splitter->getTextLength($testText) . "\n";
echo "Reassembled length: " . $splitter->getTextLength($reassembled) . "\n";
Related Documentation
- Architecture Overview - System implementation details
- Configuration Guide - Provider setup options
- Troubleshooting Guide - Common issues and solutions