Text Chunking

This guide explains how TMGMT Lara Translate handles large content by splitting it into chunks that fit within Lara's API limits.

Why Text Chunking is Needed

Lara API Limitations

10,000 character limit per translation request
HTML included in character count (not just text content)
Safety buffer: Uses 9,900 character limit (100-char margin)
Timeout prevention: Avoids API timeouts with large content

Content That Exceeds Limits

Long articles and documentation pages
HTML-rich content with many tags
Product descriptions with detailed specifications
Multi-page content combined into single items

Text Chunking Service

RecursiveCharacterTextSplitter

The module uses a sophisticated text splitter based on LangChain's algorithm:

Key Features: - Hierarchical splitting: Paragraphs → Lines → Words → Characters - Context preservation: Overlapping chunks maintain translation context - Language-aware: Different splitting patterns for code vs prose - UTF-8 safe: Proper multibyte character handling - Configurable: Customizable chunk sizes and separators

Default Configuration for Lara

$splitter->configure([
  'chunk_size' => 9900,        // 100-char safety buffer
  'chunk_overlap' => 200,         // Context overlap
  'separators' => ["\n\n", "\n", " ", ""], // Hierarchical
  'keep_separator' => KeepSeparator::End,
]);

Chunking Process

1. Length Detection

The module checks total HTML length (including tags):

$totalLength = $this->textSplitter->getTextLength($htmlContent);

if ($totalLength > self::MAX_LENGTH) {
    // Use chunking
    $chunks = $this->splitter->splitText($htmlContent);
} else {
    // No chunking needed
    $chunks = [$htmlContent];
}

2. Hierarchical Splitting

Algorithm tries separators in order:

Paragraphs (\n\n): Largest semantic units
Lines (\n): Smaller units
Spaces (): Individual words
Characters (`""): Last resort for very long words

3. Overlap Handling

// Example: 200-character overlap
$chunks = [
    "First 200 characters of long paragraph",
    "Last 1800 characters with 200-char overlap from beginning"
];

4. Reassembly

After translation, chunks are reassembled:

$translatedHtml = $this->textSplitter->reassembleChunks($translatedChunks);

Language-Specific Splitting

HTML Content

$htmlSplitter = $this->textSplitter->forLanguage('html', [
    'chunk_size' => 9900,
    'chunk_overlap' => 200,
]);

$chunks = $htmlSplitter->splitText($htmlContent);

Splits at block element boundaries (<div>, , etc.) when possible.

Markdown Content

$mdSplitter = $this->textSplitter->forLanguage('markdown');
$chunks = $mdSplitter->splitText($markdown);

Recognizes: - Headers (#, ##, ###) - Code blocks (`) - Horizontal rules (---) - List items

Code Content

$phpSplitter = $this->textSplitter->forLanguage('php');
$chunks = $phpSplitter->splitText($phpCode);

Splits at: - Function boundaries - Class declarations
- Method signatures - Code blocks

Configuration Options

Chunk Size Settings

Recommended sizes for different scenarios:

Content Type	Chunk Size	Overlap	Use Case
Short articles	5000	200	Fast processing
Long documentation	8000	400	Balance speed/context
Technical specs	9900	500	Maximum safety
Product pages	6000	300	Good balance

Separator Strategies

KeepSeparator::End (Default)

Separators removed from chunks
Added to reassembly
Clean final output

KeepSeparator::Yes

Separators kept as separate elements
Useful for preserving structure
Example: ["Sentence 1", "Sentence 2"]

KeepSeparator::No

Separators discarded completely
Continuous text flow
Example: "Sentence 1. Sentence 2"

Custom Length Functions

// Token-based counting instead of characters
$splitter->configure([
  'length_function' => function(string $text): int {
    return count(explode(' ', $text));
  },
]);

Translation Flow with Chunking

flowchart TD
    A[TMGMT Job Item] --> B{Content > 9900 chars?}
    B -->|No| C[Direct Translation]
    B -->|Yes| D[Create RecursiveCharacterTextSplitter]
    D --> E[Split into chunks]
    E --> F[Translate each chunk]
    F --> G[Reassemble chunks]
    G --> H[Save translated data]
    C --> H
    F --> I{All chunks successful?}
    I -->|Yes| G
    I -->|No| J[Job Item Failed]

Performance Considerations

Memory Efficiency

Chunk-by-chunk processing: Never loads entire document
Streaming approach: Minimal memory footprint
Garbage collection: Automatic cleanup of temporary data

Processing Time

Linear scaling: Time proportional to content size
API limits: Each chunk respects 10k character limit
Parallel potential: Chunks could be translated concurrently

Quality Impact

Context preservation: 200-char overlap maintains meaning
Semantic boundaries: Paragraph/line splitting reduces translation fragmentation
Reassembly fidelity: Exact reassembly preserves structure

Troubleshooting Chunking

Issues and Solutions

Poor Splitting Quality

Problem: Chunks split in mid-sentence or mid-word

Solutions:

// Increase overlap for better context
$splitter->configure(['chunk_overlap' => 500]);

// Use different separator order
$splitter->configure(['separators' => [". ", "? ", " ", ""]]);

Too Many Small Chunks

Problem: Excessive API calls, slow processing

Solutions:

// Increase chunk size
$splitter->configure(['chunk_size' => 9900]);

// Use more aggressive separators
$splitter->configure(['separators' => ["\n\n", "\n\n", "\n", " "]]);

Reassembly Issues

Problem: Missing content or broken HTML after reassembly

Solutions:

// Use End separator strategy
$splitter->configure(['keep_separator' => KeepSeparator::End]);

// Validate reassembled content
$validator = \Drupal::service(TextSplitterValidator::class);
$isValid = $validator->validateHtml($reassembled);

Best Practices

Configuration Guidelines

Start with defaults: 9900 chars, 200 overlap
Monitor performance: Adjust based on your content patterns
Test with samples: Verify chunking quality before production
Consider content type: Use language-specific modes

Content Preparation

Clean HTML: Remove unnecessary attributes and comments
Logical structure: Use semantic HTML elements properly
Avoid deep nesting: Simpler structure chunks better
Consistent formatting: Use same HTML patterns

Monitoring Tips

# Monitor chunking behavior
drush watchdog:show --type=tmgmt_laratranslate | grep -i "chunk\|split"

# Check chunk sizes
drush tmgmt-job-item:list --format=yaml | grep "word_count"

# Monitor translation quality
drush tmgmt-job-item:list --status=completed

Advanced Usage

Custom Splitter Configuration

// For very long technical documents
$technicalSplitter = $this->textSplitter->configure([
    'chunk_size' => 9900,
    'chunk_overlap' => 500,
    'separators' => ["\n\n", "\n", ". ", "; ", " ", ""],
    'keep_separator' => KeepSeparator::End,
    'length_function' => 'count_words', // Custom counting
]);

// For legal documents requiring sentence integrity
$legalSplitter = $this->textSplitter->configure([
    'chunk_size' => 7000,
    'chunk_overlap' => 300,
    'separators' => [". ", "! ", "? ", " ", ""],
    'keep_separator' => KeepSeparator::End,
]);

Testing Chunking

// Test splitting quality
$splitter = \Drupal::service(RecursiveCharacterTextSplitter::class);

$testText = str_repeat('Test paragraph. ', 200);
$chunks = $splitter->splitText($testText);

// Verify chunk sizes
foreach ($chunks as $i => $chunk) {
    $length = $splitter->getTextLength($chunk);
    echo "Chunk " . ($i + 1) . ": {$length} chars\n";
}

// Test reassembly
$reassembled = $splitter->reassembleChunks($chunks);
echo "Original length: " . $splitter->getTextLength($testText) . "\n";
echo "Reassembled length: " . $splitter->getTextLength($reassembled) . "\n";

Architecture Overview - System implementation details
Configuration Guide - Provider setup options
Troubleshooting Guide - Common issues and solutions

Text Chunking

Why Text Chunking is Needed

Lara API Limitations

Content That Exceeds Limits

Text Chunking Service

RecursiveCharacterTextSplitter

Default Configuration for Lara

Chunking Process

1. Length Detection

2. Hierarchical Splitting

3. Overlap Handling

4. Reassembly

Language-Specific Splitting

HTML Content

Markdown Content

Code Content

Configuration Options

Chunk Size Settings

Separator Strategies

KeepSeparator::End (Default)

KeepSeparator::Yes

KeepSeparator::No

Custom Length Functions

Translation Flow with Chunking

Performance Considerations

Memory Efficiency

Processing Time

Quality Impact

Troubleshooting Chunking

Issues and Solutions

Poor Splitting Quality

Too Many Small Chunks

Reassembly Issues

Best Practices

Configuration Guidelines

Content Preparation

Monitoring Tips

Advanced Usage

Custom Splitter Configuration

Testing Chunking

Related Documentation