Skip to content

Python & AI/ML Coding Standards

Feb 2026


1. Code Style & Formatting

Automated consistency, zero debates

Black + isort on every save — Required

Configure Black (line-length 88) and isort (profile=black) in pyproject.toml. Add pre-commit hooks so unformatted code never reaches the repo. Zero config debates.

Tools: Black, isort, pre-commit

Ruff as your single linter — Required

Ruff replaces Flake8, pylint, pycodestyle in a single Rust-powered tool (10-100x faster). Enable rules: E, F, W, I, N, UP, S, B, A, C4, SIM, TCH, RUF. Block merge on errors.

Tools: Ruff

Type hints everywhere + mypy strict — Required

Type hints on all function signatures and class attributes. Enable mypy --strict in CI. For ML: annotate tensor shapes in docstrings (e.g., # shape: (batch, seq_len, d_model)).

Tools: mypy, pyright

Organized imports: stdlib then third-party then local — Required

Group imports: (1) standard library, (2) third-party (numpy, torch, sklearn), (3) local project. One blank line between groups. isort handles this automatically. No wildcard imports.

Tools: isort

Consolidate all tool configs into pyproject.toml. No scattered setup.cfg, .flake8, mypy.ini. Pin Python version with requires-python. Use hatchling, setuptools, or flit.


2. Naming & Project Structure

Predictable, searchable, self-documenting

PEP 8 naming with no exceptions — Required

snake_case for functions/variables/modules. PascalCase for classes. UPPER_SNAKE for constants. _private prefix for internal APIs. Booleans as questions: is_trained, has_converged.

Name by intent, not type — Required

Avoid: df, model, X_train. Prefer: customer_transactions, churn_classifier, training_features. Exception: short-lived loop vars and well-known ML conventions (X, y) in small scopes.

src/<project>/ with subpackages by feature. Co-locate tests beside source. ML-specific: separate /notebooks, /configs, /data (gitignored), /models (gitignored), /src for production.

No magic numbers, use constants or configs — Required

Extract all hyperparameters to config files (YAML/TOML) or dataclasses. Use Hydra, OmegaConf, or Pydantic Settings for config management. Makes experiments reproducible.

Tools: Hydra, OmegaConf, Pydantic

Separate notebooks from production code — Required

Notebooks for exploration only, never production logic. Extract reusable code into .py modules immediately. Use nbstripout to strip outputs from committed notebooks.

Tools: nbstripout


3. Error Handling & Logging

Fail gracefully, debug quickly

Never silently swallow exceptions — Required

Every except block must: log with context, re-raise, or return meaningful error. except: pass is forbidden. For ML: catch specific failures (data loading, GPU OOM) with context to reproduce.

Structured logging with structlog — Required

Use structlog or loguru instead of print(). Log as JSON in production. Include: timestamp, severity, experiment_id, model_version. Add GPU memory and training step to ML logs.

Tools: structlog, loguru

Create: DataValidationError, ModelNotTrainedError, PipelineTimeoutError, InferenceError. Enables precise handling and better error messages. Map to HTTP codes at API boundaries.

Validate inputs at boundaries with Pydantic — Required

Pydantic BaseModel for API inputs, configs, pipeline interfaces. Validate data schemas before training with pandera. Fail fast: reject bad data before a 3-hour training run.

Tools: Pydantic, pandera

Never log secrets, PII, or model weights — Required

Sanitize logs: no API keys, user data, or raw model parameters. Be cautious with training samples containing PII. Audit log output. Compliance requirement (GDPR, CCPA).


4. Testing & Code Review

Ship with confidence

Test behavior, not implementation — Required

Use pytest. Test what code does, not how. Assert output correct for input. Arrange-Act-Assert pattern. For ML: test output shapes, prediction ranges, preprocessing determinism.

Tools: pytest

Testing pyramid: unit, integration, E2E — Required

Many fast unit tests, some integration tests (API, DB, pipeline stages), few E2E tests (full train-to-inference). Target 70-80% coverage on business logic. @pytest.mark.slow for heavy tests.

Tools: pytest-cov

ML-specific: test pipelines and model contracts — Required

Test: data loading schema, deterministic preprocessing (set seeds), model input shapes, valid prediction ranges, saved model reload produces same output. Use fixtures for synthetic data.

Tools: pytest fixtures

CI blocks merge on any failure — Required

Every PR triggers: Ruff lint, mypy check, pytest. Single failure blocks merge. Keep unit tests <5 min. Flaky tests are bugs. Use GitHub Actions or GitLab CI.

Tools: GitHub Actions

Small PRs, review for logic not style — Required

Style enforced by Black + Ruff. Humans review for: correctness, edge cases, error handling, security, performance. 1 approval required. PRs <400 lines. Use PR templates.


5. AI/ML Best Practices

Reproducible, responsible, production-ready

Pin seeds everywhere for reproducibility — Required

Set seeds: random, numpy, torch, tensorflow, PYTHONHASHSEED. Use deterministic algorithms (torch.use_deterministic_algorithms). Log full env: Python version, packages, GPU, CUDA.

Tools: random.seed, torch.manual_seed

Version data, models, configs, and code — Required

DVC or MLflow for data/model versioning. Track experiments with W&B, MLflow, or Neptune. Configs alongside code (Hydra). Every experiment reproducible from commit hash + config.

Tools: DVC, MLflow, W&B, Hydra

Separate training, eval, and inference code — Required

Clean interfaces: Trainer.train(), Evaluator.evaluate(), Predictor.predict(). Each independently testable. Makes it trivial to swap models or deploy to different targets.

Validate data quality before and after transforms — Required

Use pandera or great_expectations for data schemas. Validate: column types, value ranges, nulls, distribution drift. Run on raw input AND after preprocessing. Fail pipeline on check failure.

Tools: pandera, great_expectations

Treat AI-generated code as untrusted input — Required

Always review Copilot/Claude output for correctness, security, standards adherence. Run full lint + type check + tests. AI does not know your architecture. Never blindly accept.

Every deployed model needs a model card: intended use, limitations, training data summary, eval metrics, bias analysis, failure modes. Consider fairness metrics for user-facing models.

Tools: Model Cards, SHAP, LIME

Containerize and pin for deployment — Required

Docker for reproducible environments. Pin ALL deps with pip-compile or Poetry lock. Pin CUDA/cuDNN in Dockerfile. No 'latest' tags. Test inference in container before deploy.

Tools: Docker, pip-tools, Poetry


Summary

Section Total Required Recommended
Code Style & Formatting 5 4 1
Naming & Project Structure 5 4 1
Error Handling & Logging 5 4 1
Testing & Code Review 5 5 0
AI/ML Best Practices 7 6 1
Total 27 23 4