How to train your Dragon! erm Language Detector!

综合编程 2016-04-09

In the previous post we have describe the general process of training an N-Gram language detector. If you have not yet read the previous post describing the general technique I suggest that you have a look atthat post first. In this post we will go through the implementation details and creating a simple language detector using Node. To keep this post concise I will set the language detection parameters - top n-grams to select and n-gram length - manually. In a future post I will show you how to optimise the selection of these parameters by splitting the dataset into three parts; training , validation and test .

Language Profiles

One of the most important parts of an N-Gram language detector is the creation of language profiles. For this we need some sample text in each language we want to detect. There are many dataset out there which can be used for this purpose, but for this post we will use the Universal Declaration of Human Rights Database since i) it has been translated into over 460 different languages and dialects and ii) its free.

One question you might have is whether the source dataset choice affects the quality of our detector. If you remember from the previous the N-Gram distribution for the top 300 or so N-Grams will be highly correlated to the language. It is only when we start making use of more N-Grams that our source document choice will become important since it start skewing the NGram distribution - making it topical.

Having justified our dataset choice, go ahead and download the declaration of human rights PDF for each language you are interested from here and covert these PDFs to text. Feel free to use mylanguage subset if you want to get started immediately. This dataset includes the following languages:

  • czc.txt - Czech (Cesky)
  • dns.txt - Danish (Dansk)
  • dut.txt - Dutch (Nederlands)
  • eng.txt - English
  • frn.txt - French (Français)
  • ger.txt - German (Deutsch)
  • grk.txt - Ellinika' (Greek)
  • hng.txt - Hungarian
  • itn.txt - Italian
  • jpn.txt - Japanese (Nihongo)
  • lat.txt - Latvian
  • lit.txt - Lithuanian (Lietuviskai)
  • ltn.txt - Latin (Latina)
  • ltn1.txt - Latin (Latina)
  • lux.txt - Luxembourgish (Lëtzebuergeusch)
  • mls.txt - Maltese
  • por.txt - Portuguese
  • rmn1.txt - Romani
  • rum.txt - Romanian (Româna)
  • rus.txt - Russian (Russky)
  • spn.txt - Español (Spanish)
  • ukr.txt - Ukrainian (Ukrayins'ka)
  • yps.txt - Yapese

Now that we have a sample textfile for each language let us create the language profiles using Node. To start with let's create a package.json similar to the following:

  "name": "language-detector",
  "version": "1.0.0",
  "description": "Language Detector Using NGrams",
  "main": "train.js",
  "author": "Mark Galea",
  "scripts": {
    "train": "./train.js",
  "dependencies": {
    "glob": "^7.0.3"

and run npm install to install the required dependencies. Note the scripts sections contains a train command which will invoke the ./train.js javascript file.

After downloading all of our dependencies let us create the function generateProfile which will create an NGram profile . Since the language profiles and document profile are generated using the same process we will extract this common functionality in the file ngram-utils.js .

function sortNgrams(ngrams){
    return Object.keys(ngrams).map(function(key) {
        return {'ngram': key, 'freq':ngrams[key]};
    }).sort(function(first, second) {
        // If the frequency is the same favour larger ngrams
        return second['freq'] - first['freq'];
    }).map(function(ngram, index){
        ngram['index'] = index;
        return ngram;

function merge(obj1,obj2){
    var obj3 = {};
    for (var attrname in obj1) { obj3[attrname] = obj1[attrname]; }
    for (var attrname in obj2) { obj3[attrname] = obj2[attrname]; }
    return obj3;

function getNgrams(text, n){
    var ngrams = {};
    var content = text
            // Make sure that there is letter a period.
            .replace(/.s*/g, '_')                     
            // Discard all digits.
            .replace(/[0-9]/g, "")
            //Discard all punctuation except for Apostrophe.
            // Remove duplicate spaces. 
            .replace(/s*/g, '')
    for(var i = 0; i < content.length - (n-1); i++){
        var token = content.substring(i, i + n); 
        if (token in ngrams)
            ngrams[token] += 1; 
            ngrams[token] = 1; 
    return ngrams; 

exports.generateProfile = function(text, topN){
    var biGrams = getNgrams(text, 2);
    var triGrams = getNgrams(text, 3);
    var ngrams = merge(biGrams, triGrams);
    var sortedNgrams = sortNgrams(ngrams);
    return sortedNgrams.slice(0, topN);

The code is pretty self explanatory:

  • The getNgrams function takes in an input string text together with the nGram length and returns a list of nGrams frequencies.
  • The sortNgrams function takes in a dictionary of nGrams and sorts them by frequency storing the index .
  • The generateProfile combines the functionality provided by these two methods in order to merge a list of bi-grams and tri-grams returning the top N result based on frequency.

Finally, let us create a language profile for each language present in the subset folder and save all profiles in the file language-profile.json . We will make use of this file in the next section.

#!/usr/bin/env node
var glob = require("glob")
var fs = require('fs');
var ngramUtils = require('./ngram-utils');

var ngrams = {}

console.log("Started Application");
// options is optional
glob("subset/*.txt", function (er, files) {
    var languageProfile = {};
        var lang = path.basename(file, '.txt'); 
        console.log("Training [Lanugage: ", lang, "] [File: ", file, "]");
        var text = fs.readFileSync(file,'utf8'); 
        languageProfile[lang] = ngramUtils.generateProfile(text, 300);
    fs.writeFileSync('language-profile.json', JSON.stringify(languageProfile));
    console.log("Written Language Profile to File [language-profile.json]");


Now that we have created our language profiles, how can we use these profiles to detect the language of a text fragment?

What we need to do is to use the generateProfile function to create a document profile and use this document profile to calculate a simple rank-order statistic which we will call the "out of place" measure.

Before we start, let's expose the detect command in the package.json - so that we can invoke detection using npm run detect - and create the initial detect.js file:

  "name": "language-detector",
  "version": "1.0.0",
  "description": "Language Detector Using NGrams",
  "main": "train.js",
  "author": "Mark Galea",
  "scripts": {
    "train": "./train.js",
    "detect": "./detect.js"
  "dependencies": {
    "glob": "^7.0.3"
#!/usr/bin/env node
var fs = require('fs');
var ngramUtils = require('./ngram-utils');

Now let us read the language profiles we have created in the previous step:

console.log("Reading Language Profiles from [language-profile.json]");
var languageProfiles = JSON.parse(fs.readFileSync('language-profile.json', 'utf-8'));

Next, let us read the user text fragment and generate the document profile

var text = process.argv[2]; 
console.log("Determining Language for [text: ", text, "]");
var documentProfile = ngramUtils.generateProfile(text, 300);

Now its time to compute the "out of place" measure for each NGram present in the document profile

    var documentIndex = documentNgram.index; 
    var languages = Object.keys(languageProfiles);

        var languageProfile = languageProfiles[language];
        var languageNgram = languageProfile.filter(function(languageNgram){
            return languageNgram.ngram == documentNgram.ngram;

        if (languageNgram.length == 1){
            // We found the ngram so we compute the out of place measure.  
            scores[language] +=  Math.abs(languageNgram[0].index - documentIndex);
        } else {
            // We did not find the ngram so we penalise.
            scores[language] += NOT_FOUND;

Finally, let us sort the score results and present them to the user. Note that the lower the score the "closer" the text fragment is to the language.

function sortScores(scores){
    return Object.keys(scores).map(function(language) {
        return {'language': language, 'score':scores[language]};
    }).sort(function(first, second) {
        return first['score'] - second['score'];

Source Code

If you are having trouble with the implementation I have uploaded the complete source code on Github . Once you cloned the repository run npm install and create the language profiles by executing:

npm run train

You can try the detector by executing:

npm run detect 

As an example running the detector with the phrase "What is the weather today?" will detect that the phrase is in English.

npm run detect "What is the weather today?"

Running the detector on the same phrase translated in Maltese will detect that the phrase is in Maltese

npm run detect "X'inhu l-temp illum?"


In this post we have implemented an NGram language detector using Node. Hope you guys had fun reading this post! Stay safe and keep hacking!

责编内容by:Mark Galea (源链)。感谢您的支持!


Packet Pushers 333: Orchestration v Automation <... I highly recommend catching Packet Pushers 333 “Automation & Orchestration...
Why I Gave My Engineering Team a Chance for a Do-O... Transforming to a microservices architecture requires coordination across all di...
Four rules for effective collaboration – Guideline... Four rules for effective collaboration Whether you're working in a large team...
Lightboard Lessons: F5 BIG-IP DNS (GTM) iQuery Pro... In this episode of Lightboard Lessons, I introduce iQuery, the F5 proprietary pr...
云计算网络中混合WAN和SD-WAN的不同 混合广域网(WAN)和软件定义的广域网(SD-WAN)这两个术语经常被人们错误地视为可互换的概念。混合广域网是在多个连接路径上路由流量的实践,而软件定义广域网(...