Last week Anthropic released an open-source “circuit tracing” toolkit that turns a language model’s opaque activations into an attribution graph you can inspect and even tinker with. The MIT-licensed library on GitHub automatically uncovers the causal paths linking input tokens, hidden features, and output logits, displays them in an interactive Neuronpedia viewer, and lets researchers tweak individual features to see how the model’s answer shifts. Anthropic’s demos already show the tool surfacing multilingual features in compact Llama-3 and Gemma checkpoints.
Why this matters for Pombo Labs
Our mission is to build trustworthy AI that speaks all Sierra Leonean languages and English with equal fluency. A multilingual system like that must learn several delicate skills at once:
- Cross-language alignment. The model should map concepts consistently across languages. Attribution graphs can reveal whether tokens that carry the same idea in different tongues converge on the same hidden features or whether the model is silently keeping separate “thought streams.”
- Error forensics. When the model mistranslates an idiom, we’ll be able to trace exactly which feature (or missing feature) blocked the correct meaning, instead of guessing from the output. That speeds up debugging than blind fine-tuning.
- Bias and safety audits. By intervening on features we suspect encode gender or regional stereotypes, we can measure how much those circuits steer the final answer and design targeted mitigations.
Our adoption roadmap
- Prototype phase (June – July 2025).
- Fine-tune an open 2-7 B-parameter model (Gemma-2 B or Llama-3-8 B) on our existing bilingual corpora.
- Attach Anthropic’s transcoder layers and generate graphs for translation, summarization, and code-switching prompts.
- Diagnostic library (August 2025).
- Script recurring checks that flag new training runs whenever a core circuit (e.g., “Krio → English negation mapping”) degrades.
- Store and version those graphs in the repo alongside model weight interpretability as an artifact, not an afterthought.
- Community release (Q4 2025).
- Publish the graphs for low-resource languages on Neuronpedia and invite external contributors to annotate intriguing motifs.
- Upstream any improvements (e.g., token-level saliency overlays) back to the original Anthropic repo, closing the open-source loop.
Open source is the accelerator
Keeping this pipeline public forces us to engineer clarity, not hacks; lets outside experts replicate our findings; and lowers the entry bar for students in Sierra Leone who want to learn interpretability without a giant GPU cluster (Gemma tracing already fits in free Colab). More importantly, it means our local-language insights can flow back into global safety research instead of staying siloed.
If you’d like to help annotate circuits or donate domain-specific data for Sierra Leonean local languages, reach out. The sooner we can see what our models are thinking, the sooner we can trust them to teach the next generation.