Fast State-of-the-art tokenizers, optimized for both research and production. Tokenizers provides an implementation of today’s most used tokenizers, with a focus on performance and versatility. These tokenizers are also used in Transformers. Train new vocabularies and tokenize, using today’s most used tokenizers. Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server’s CPU. Easy to use, but also extremely versatile. Designed for both research and production. Full alignment tracking. Even with destructive normalization, it’s always possible to get the part of the original sentence that corresponds to any token. Does all the pre-processing: Truncation, Padding, add the special tokens your model needs.

Features

  • Train new vocabularies and tokenize, using today’s most used tokenizers
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server’s CPU
  • Easy to use, but also extremely versatile
  • Designed for both research and production
  • Full alignment tracking
  • Truncation, Padding, add the special tokens your model needs

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow Tokenizers

Tokenizers Web Site

Other Useful Business Software
Earn up to 16% annual interest with Nexo. Icon
Earn up to 16% annual interest with Nexo.

Access competitive interest rates on your digital assets.

Generate interest, borrow against your crypto, and trade a range of cryptocurrencies — all in one platform. Geographic restrictions, eligibility, and terms apply.
Get started with Nexo.
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of Tokenizers!

Additional Project Details

Programming Language

Rust

Related Categories

Rust Artificial Intelligence Software, Rust Machine Learning Software

Registered

2023-03-23