Skip to content

Text Chunking

This guide explains how TMGMT Lara Translate handles large content by splitting it into chunks that fit within Lara's API limits.

Why Text Chunking is Needed

Lara API Limitations

  • 10,000 character limit per translation request
  • HTML included in character count (not just text content)
  • Safety buffer: Uses 9,900 character limit (100-char margin)
  • Timeout prevention: Avoids API timeouts with large content

Content That Exceeds Limits

  • Long articles and documentation pages
  • HTML-rich content with many tags
  • Product descriptions with detailed specifications
  • Multi-page content combined into single items

Text Chunking Service

RecursiveCharacterTextSplitter

The module uses a sophisticated text splitter based on LangChain's algorithm:

Key Features: - Hierarchical splitting: Paragraphs → Lines → Words → Characters - Context preservation: Overlapping chunks maintain translation context - Language-aware: Different splitting patterns for code vs prose - UTF-8 safe: Proper multibyte character handling - Configurable: Customizable chunk sizes and separators

Default Configuration for Lara

$splitter->configure([
  'chunk_size' => 9900,        // 100-char safety buffer
  'chunk_overlap' => 200,         // Context overlap
  'separators' => ["\n\n", "\n", " ", ""], // Hierarchical
  'keep_separator' => KeepSeparator::End,
]);

Chunking Process

1. Length Detection

The module checks total HTML length (including tags):

$totalLength = $this->textSplitter->getTextLength($htmlContent);

if ($totalLength > self::MAX_LENGTH) {
    // Use chunking
    $chunks = $this->splitter->splitText($htmlContent);
} else {
    // No chunking needed
    $chunks = [$htmlContent];
}

2. Hierarchical Splitting

Algorithm tries separators in order:

  1. Paragraphs (\n\n): Largest semantic units
  2. Lines (\n): Smaller units
  3. Spaces (): Individual words
  4. Characters (`""): Last resort for very long words

3. Overlap Handling

// Example: 200-character overlap
$chunks = [
    "First 200 characters of long paragraph",
    "Last 1800 characters with 200-char overlap from beginning"
];

4. Reassembly

After translation, chunks are reassembled:

$translatedHtml = $this->textSplitter->reassembleChunks($translatedChunks);

Language-Specific Splitting

HTML Content

$htmlSplitter = $this->textSplitter->forLanguage('html', [
    'chunk_size' => 9900,
    'chunk_overlap' => 200,
]);

$chunks = $htmlSplitter->splitText($htmlContent);

Splits at block element boundaries (<div>, <p>, etc.) when possible.

Markdown Content

$mdSplitter = $this->textSplitter->forLanguage('markdown');
$chunks = $mdSplitter->splitText($markdown);

Recognizes: - Headers (#, ##, ###) - Code blocks (`) - Horizontal rules (---) - List items

Code Content

$phpSplitter = $this->textSplitter->forLanguage('php');
$chunks = $phpSplitter->splitText($phpCode);

Splits at: - Function boundaries - Class declarations
- Method signatures - Code blocks

Configuration Options

Chunk Size Settings

Recommended sizes for different scenarios:

Content Type Chunk Size Overlap Use Case
Short articles 5000 200 Fast processing
Long documentation 8000 400 Balance speed/context
Technical specs 9900 500 Maximum safety
Product pages 6000 300 Good balance

Separator Strategies

KeepSeparator::End (Default)

  • Separators removed from chunks
  • Added to reassembly
  • Clean final output

KeepSeparator::Yes

  • Separators kept as separate elements
  • Useful for preserving structure
  • Example: ["<p>Sentence 1</p>", "<p>Sentence 2</p>"]

KeepSeparator::No

  • Separators discarded completely
  • Continuous text flow
  • Example: "Sentence 1. Sentence 2"

Custom Length Functions

// Token-based counting instead of characters
$splitter->configure([
  'length_function' => function(string $text): int {
    return count(explode(' ', $text));
  },
]);

Translation Flow with Chunking

flowchart TD
    A[TMGMT Job Item] --> B{Content > 9900 chars?}
    B -->|No| C[Direct Translation]
    B -->|Yes| D[Create RecursiveCharacterTextSplitter]
    D --> E[Split into chunks]
    E --> F[Translate each chunk]
    F --> G[Reassemble chunks]
    G --> H[Save translated data]
    C --> H
    F --> I{All chunks successful?}
    I -->|Yes| G
    I -->|No| J[Job Item Failed]

Performance Considerations

Memory Efficiency

  • Chunk-by-chunk processing: Never loads entire document
  • Streaming approach: Minimal memory footprint
  • Garbage collection: Automatic cleanup of temporary data

Processing Time

  • Linear scaling: Time proportional to content size
  • API limits: Each chunk respects 10k character limit
  • Parallel potential: Chunks could be translated concurrently

Quality Impact

  • Context preservation: 200-char overlap maintains meaning
  • Semantic boundaries: Paragraph/line splitting reduces translation fragmentation
  • Reassembly fidelity: Exact reassembly preserves structure

Troubleshooting Chunking

Issues and Solutions

Poor Splitting Quality

Problem: Chunks split in mid-sentence or mid-word

Solutions:

// Increase overlap for better context
$splitter->configure(['chunk_overlap' => 500]);

// Use different separator order
$splitter->configure(['separators' => [". ", "? ", " ", ""]]);

Too Many Small Chunks

Problem: Excessive API calls, slow processing

Solutions:

// Increase chunk size
$splitter->configure(['chunk_size' => 9900]);

// Use more aggressive separators
$splitter->configure(['separators' => ["\n\n", "\n\n", "\n", " "]]);

Reassembly Issues

Problem: Missing content or broken HTML after reassembly

Solutions:

// Use End separator strategy
$splitter->configure(['keep_separator' => KeepSeparator::End]);

// Validate reassembled content
$validator = \Drupal::service(TextSplitterValidator::class);
$isValid = $validator->validateHtml($reassembled);

Best Practices

Configuration Guidelines

  1. Start with defaults: 9900 chars, 200 overlap
  2. Monitor performance: Adjust based on your content patterns
  3. Test with samples: Verify chunking quality before production
  4. Consider content type: Use language-specific modes

Content Preparation

  1. Clean HTML: Remove unnecessary attributes and comments
  2. Logical structure: Use semantic HTML elements properly
  3. Avoid deep nesting: Simpler structure chunks better
  4. Consistent formatting: Use same HTML patterns

Monitoring Tips

# Monitor chunking behavior
drush watchdog:show --type=tmgmt_laratranslate | grep -i "chunk\|split"

# Check chunk sizes
drush tmgmt-job-item:list --format=yaml | grep "word_count"

# Monitor translation quality
drush tmgmt-job-item:list --status=completed

Advanced Usage

Custom Splitter Configuration

// For very long technical documents
$technicalSplitter = $this->textSplitter->configure([
    'chunk_size' => 9900,
    'chunk_overlap' => 500,
    'separators' => ["\n\n", "\n", ". ", "; ", " ", ""],
    'keep_separator' => KeepSeparator::End,
    'length_function' => 'count_words', // Custom counting
]);

// For legal documents requiring sentence integrity
$legalSplitter = $this->textSplitter->configure([
    'chunk_size' => 7000,
    'chunk_overlap' => 300,
    'separators' => [". ", "! ", "? ", " ", ""],
    'keep_separator' => KeepSeparator::End,
]);

Testing Chunking

// Test splitting quality
$splitter = \Drupal::service(RecursiveCharacterTextSplitter::class);

$testText = str_repeat('Test paragraph. ', 200);
$chunks = $splitter->splitText($testText);

// Verify chunk sizes
foreach ($chunks as $i => $chunk) {
    $length = $splitter->getTextLength($chunk);
    echo "Chunk " . ($i + 1) . ": {$length} chars\n";
}

// Test reassembly
$reassembled = $splitter->reassembleChunks($chunks);
echo "Original length: " . $splitter->getTextLength($testText) . "\n";
echo "Reassembled length: " . $splitter->getTextLength($reassembled) . "\n";