# AGENTS.md - Guide for AI Agents Working on chardet

This document helps AI agents understand the chardet codebase and work effectively with it.

## Project Overview

**chardet** is a Python port of Mozilla's Universal Character Encoding Detector algorithm. It automatically detects the character encoding of text documents by analyzing their byte patterns, character distributions, and coding schemes.

- **Original Algorithm**: Based on Mozilla's work described at https://www-archive.mozilla.org/projects/intl/universalcharsetdetection
- **Python Porter**: Mark Pilgrim
- **Primary Maintainer**: Dan Blanchard (@dan-blanchard)
- **License**: LGPL v2.1+
- **Python Support**: 3.10+ (maintain compatibility with Python 3.10)
- **Package Manager**: This project uses [uv](https://docs.astral.sh/uv/) for dependency management

**Important**: Prefix all Python commands with `uv run` (e.g., `uv run pytest test.py` instead of `pytest test.py`)

### Key Design Principles

1. **Legacy Support Focus**: chardet is slower than modern alternatives (charset-normalizer, cChardet) but exists for:
   - Legacy projects that cannot migrate to charset-normalizer
   - Python implementations not supported by cChardet (PyPy, IronPython)

2. **Detection Algorithm**: Uses a composite approach with three complementary methods:
   - **Coding Scheme Method**: Parallel state machines detect invalid byte sequences
   - **Character Distribution Method**: Analyzes frequency of characters (unigrams) for multi-byte encodings
   - **2-Char Sequence Distribution Method**: Analyzes bigram frequencies for single-byte encodings

3. **Accuracy Over Speed**: Performance improvements are welcome, but accuracy is the primary goal.

## Architecture Overview

See `NOTES.rst` for detailed class hierarchy. Key components:

### Core Detection Flow

```
UniversalDetector
  ├─> CharSetGroupProber (abstract orchestrator)
  │    ├─> MBCSGroupProber (multi-byte charsets: UTF-8, GB18030, Big5, EUC-*, Shift-JIS, etc.)
  │    │    └─> MultiByteCharSetProber (uses CodingStateMachine + CharDistributionAnalysis)
  │    └─> SBCSGroupProber (single-byte charsets: ISO-8859-*, Windows-125*, etc.)
  │         └─> SingleByteCharSetProber (uses precedence matrix/bigram model)
```

### Key Classes & Their Roles

- **UniversalDetector** (`universaldetector.py`): Entry point, coordinates all probers
- **CharSetProber** (`charsetprober.py`): Abstract base class for all probers
- **CharSetGroupProber** (`charsetgroupprober.py`): Runs multiple related probers simultaneously
- **CodingStateMachine** (`codingstatemachine.py`): Detects invalid byte sequences using state machines
- **CharDistributionAnalysis** (`chardistribution.py`): Analyzes 2-byte character unigram frequencies
- **SingleByteCharSetProber** (`sbcharsetprober.py`): Uses precedence matrices (bigrams) for single-byte encodings

## File Organization

### Language Models (Generated - Do Not Edit Manually)

**Bigram Models** (for single-byte encodings):

- `lang*model.py` (e.g., `langbulgarianmodel.py`, `langfrenchmodel.py`, etc.)
- Generated by `create_language_model.py`
- Used by `SingleByteCharSetProber` via `SBCSGroupProber`

**Unigram Models** (for multi-byte encodings):

- `*freq.py` (e.g., `big5freq.py`, `euckrfreq.py`, `gb2312freq.py`, etc.)
- Generated separately (older process)
- Used by `CharDistributionAnalysis`

### Prober Files

**Multi-byte Probers**:

- `big5prober.py`, `cp949prober.py`, `eucjpprober.py`, `euckrprober.py`
- `gb18030prober.py`, `sjisprober.py`, `johabprober.py`
- `mbcharsetprober.py` (base class), `mbcsgroupprober.py` (orchestrator)

**Single-byte Probers**:

- `latin1prober.py`, `hebrewprober.py`, `macromanprober.py`
- `sbcharsetprober.py` (base class), `sbcsgroupprober.py` (orchestrator)

**Coding Scheme Files**:

- `utf8prober.py`, `utf1632prober.py`
- `escprober.py` (escape sequences), `escsm.py` (escape state machines)
- `codingstatemachine.py`, `mbcssm.py` (state machine definitions)

### Supporting Files

- `__init__.py`: Main `detect()` and `detect_all()` functions
- `cli/`: Command-line tool (`chardetect`)
- `enums.py`: Constants and enumerations
- `metadata/`: Language metadata
- `version.py`: Version string

### Training Data & Scripts

- `wiki_*.txt`: Cached Wikipedia training data (generated, not committed for licensing reasons)
- `create_language_model.py`: Retrains bigram models from Wikipedia
- `convert_language_model.py`: Utility for converting between model formats
- `bench.py`: Benchmarking script

## Development Workflow

### Testing

**Run all tests**:

```bash
uv run pytest test.py
```

**Test structure**:

- `test.py`: Main test runner that iterates through `tests/` directory
- `tests/`: Contains subdirectories named by encoding (e.g., `tests/utf-8/`, `tests/iso-8859-1/`)
- Each subdirectory contains sample files (.txt, .html, .xml, .srt) in that encoding
- Tests verify that `chardet.detect()` returns the correct encoding name

**Expected Failures**:

- See `EXPECTED_FAILURES` set in `test.py` for known failing test cases
- Use `pytest.mark.xfail` for expected failures

### Linting & Formatting

**Pre-commit** (automatically runs on commit):

```bash
pre-commit install  # Install hooks
```

**Manual linting** (recommended before committing):

```bash
uv run ruff check .        # Check for errors
uv run ruff format .       # Format code
```

**Configuration**:

- Uses ruff for linting and formatting (configured in `pyproject.toml`)
- Follows Black-style formatting
- Ignores E501 (line length) but aims for reasonable line lengths

### Building & Distribution

```bash
# Build distribution
uv run python -m build

# Development environment is managed by uv (see uv.lock)
# Install dependencies with:
uv sync
```

### Version Management

- Versions are automatically managed by `hatch-vcs` from git tags
- See `[tool.hatch.version]` in `pyproject.toml`

## Common Tasks

### 1. Adding Support for a New Encoding

**For multi-byte encodings:**

1. Create a new prober class in `chardet/` (e.g., `mynewencoding prober.py`)
2. Inherit from `MultiByteCharSetProber`
3. Define state machine in `mbcssm.py` or create new SM file
4. Add character distribution data if applicable (create `mynewencodingfreq.py`)
5. Register prober in `MBCSGroupProber` (`mbcsgroupprober.py`)
6. Add test files to `tests/mynewencoding/`
7. Update README.rst with new encoding

**For single-byte encodings:**

1. Add language to `chardet/metadata/languages.py` if needed
2. Generate training model using `create_language_model.py` (requires Wikipedia data)
3. The script will create `langmylanguagemodel.py` automatically
4. Register in `SBCSGroupProber` (`sbcsgroupprober.py`)
5. Add test files to `tests/mynewencoding-mylanguage/`
6. Update README.rst with new encoding

### 2. Fixing Detection Accuracy Issues

**Debugging steps:**

1. Add a test case file to appropriate `tests/encoding/` directory
2. Run `pytest test.py -k <filename>` to verify failure
3. Use `chardet.detect_all(bytes, ignore_threshold=True)` to see all prober results
4. Check which prober is producing the wrong result
5. Examine the prober's state machine or language model
6. Adjust thresholds, state machines, or retrain models as needed

**Key confidence thresholds:**

- See `get_confidence()` methods in prober classes
- Typical threshold: 0.95 (see `MINIMUM_THRESHOLD` in various files)
- Adjust carefully - affects overall detection accuracy

### 3. Improving Performance

**Known bottlenecks:**

- Character distribution analysis (unigram/bigram lookups)
- State machine transitions
- Multiple probers running in parallel

**Optimization guidelines:**

- Profile before optimizing: use `bench.py` or Python profilers
- Avoid breaking existing detection accuracy
- Test thoroughly after changes (run full test suite)
- Consider using `__slots__` for frequently instantiated classes
- Optimize hot paths (inner loops in feed() methods)

### 4. Updating Documentation

**Key documentation files:**

- `README.rst`: User-facing documentation (installation, usage, encodings list)
- `NOTES.rst`: Internal architecture and developer notes
- `docs/`: Sphinx documentation (hosted at https://chardet.readthedocs.io/)

**After documentation changes:**

- No special linting/testing required for docs-only changes
- Preview Sphinx docs locally: `cd docs && make html`

### 5. Retraining Language Models

1. Ensure dependencies are installed: `uv sync`
2. Run training script:
   ```bash
   uv run python create_language_model.py <language> --max-pages 20000
   ```
3. Script downloads Wikipedia articles and generates `wiki_<language>.txt` cache
4. Generates/updates `lang<language>model.py` in working directory
5. Move generated `lang<language>model.py` to `chardet/` directory
6. Commit `lang*model.py` files after moving
7. Test thoroughly with existing and new test files

**DO NOT**:

- Modify `wiki_*.txt` files manually
- Delete `wiki_*.txt` files (needed for licensing compliance)

## Important Constraints

### Python Compatibility

- **Minimum version**: Python 3.10
- **Do not use**:
  - Syntax or features introduced after Python 3.10
  - Standard library methods not available in 3.10
  - Type hints that require `from __future__ import annotations` unless already present

### Performance Philosophy

- Focus on correctness and compatibility over speed
- Performance improvements are welcome but secondary to accuracy

### Breaking Changes

- Avoid breaking the public API (`detect()`, `detect_all()`, `UniversalDetector`)
- Maintain backward compatibility with existing detection behavior
- Add new parameters as optional with sensible defaults

## Code Style Guidelines

### General Principles

- Follow PEP 8 (enforced by ruff)
- Use Black-style formatting
- Type hints are optional but encouraged
- Docstrings for public APIs

### Naming Conventions

- Classes: `PascalCase` (e.g., `UniversalDetector`)
- Functions/methods: `snake_case` (e.g., `get_confidence()`)
- Constants: `UPPER_SNAKE_CASE` (e.g., `MINIMUM_THRESHOLD`)
- Private members: prefix with `_` (e.g., `_mDistributionAnalyzer`)

### Comments

- Comment complex algorithms or non-obvious logic
- Don't over-comment obvious code
- Prefer self-documenting code with clear variable names

## Testing Best Practices

### Adding Test Files

1. Place files in `tests/<encoding>/` or `tests/<encoding>-<language>/`
2. Use standard extensions: `.txt`, `.html`, `.xml`, `.srt`
3. Ensure files are valid in their claimed encoding
4. Test files should be representative of real-world usage
5. Include both typical and edge cases

### Writing Tests

- Tests are automatically generated from `tests/` directory structure
- Add to `EXPECTED_FAILURES` for known issues (with issue tracker reference)
- Use hypothesis for property-based testing when applicable

## Debugging Tips

### Understanding Detection Results

```python
# Run with: uv run python your_script.py
import chardet

# Get single best result
result = chardet.detect(byte_data)
print(result)  # {'encoding': 'utf-8', 'confidence': 0.99, 'language': 'English'}

# Get all results (for debugging)
all_results = chardet.detect_all(byte_data, ignore_threshold=True)
for r in all_results:
    print(f"{r['encoding']}: {r['confidence']}")
```

### Common Issues

1. **False positives**: Prober confidence too high for wrong encoding
   - Solution: Adjust confidence thresholds or improve state machines

2. **False negatives**: Correct encoding not detected
   - Solution: Add more training data or adjust minimum thresholds

3. **Slow detection**: Processing large files takes too long
   - Solution: Profile and optimize hot paths, consider early exits

## Resources

### External Documentation

- **Original Mozilla Paper**: https://www-archive.mozilla.org/projects/intl/universalcharsetdetection
- **Mozilla Source Code**: https://dxr.mozilla.org/mozilla/source/intl/chardet/
- **User Documentation**: https://chardet.readthedocs.io/
- **Repository**: https://github.com/chardet/chardet
- **Issues**: https://github.com/chardet/chardet/issues

### Related Projects

- **charset-normalizer**: Modern, faster alternative (pure Python)
- **cChardet**: Fast C-based implementation (CPython only)
- **uchardet**: C++ port used by various applications

## Quick Reference

### Common Commands

```bash
# Run tests
uv run pytest test.py

# Run specific test
uv run pytest test.py -k "test_encoding_detection[tests/utf-8/sample.txt-utf-8]"

# Lint code
uv run ruff check .

# Format code
uv run ruff format .

# Build distribution
uv run python -m build

# Run chardetect CLI
uv run chardetect file1.txt file2.txt
```

### Entry Points for Code Changes

- **Detection logic**: `universaldetector.py`, `*prober.py` files
- **State machines**: `codingstatemachine.py`, `*sm.py` files
- **Language models**: `lang*model.py`, `*freq.py` files (generated)
- **API**: `__init__.py`
- **CLI**: `cli/chardetect.py`

## Workflow Checklist

Before committing changes:

- [ ] Run `uv run pytest test.py` - all tests pass (or expected failures unchanged)
- [ ] Run `uv run ruff check .` - no new linting errors
- [ ] Run `uv run ruff format .` - code is formatted
- [ ] Update README.rst if adding new encodings or changing API
- [ ] Update NOTES.rst if changing architecture
- [ ] Verify Python 3.10+ compatibility
- [ ] Test with real-world files if fixing detection accuracy
- [ ] Consider performance impact if changing hot paths

## Questions?

When uncertain:

- Check existing code for patterns and precedents
- Review `NOTES.rst` for architecture details
- Look at similar encodings/probers for examples
- Ask the maintainer if fundamental design decisions are needed
