ScandiProb: Hybrid Language ID Classifier

By Ian Rodriguez

Enter text or upload a file to output independent probabilities that it is written in Norwegian, Swedish, Danish, or None of the Above / Non-Scandinavian. Only the first 512 tokens of input will be used.

This model utilizes a fine-tuned ScandiBERT, trained on limited amounts of OPUS-100, and combined with regex-enforced heuristics. Achieves ~93% macro-F1 score on OPUS-100 test set and ~84% macro-F1 score against the comprehensive SLIDE eval set, with a fraction of the training data used in SLIDE.