Some time ago, while studying TF-IDF and Jaccard similarity coefficients to calculate text similarity (the recommended articles at the bottom of some of my blog posts are calculated using this algorithm), I came across some aspects of Chinese word segmentation. Due to limited energy at the time, I directly used the Python "jieba" segmenter to implement it.
Coincidentally, I heard that my former company recently held an algorithm competition, where the task was to perform Chinese word segmentation on the novel "The Three-Body Problem"! During my free time, I simply wrote a Node.js version of the algorithm, consisting of 100 lines of code. Although it is still very basic, I wanted to write something to "commemorate" it.
Word segmentation is one of the more fundamental technologies widely applied in areas such as search and natural language processing. Since English words are separated by spaces, word segmentation for English texts is not that complex. The challenge arises in languages like Chinese, where there are no obvious separations between words, making it quite difficult to segment.
Common methods in the market include the following: the first method involves statistical analysis to split words, and the second method employs natural language analysis to determine the constituents of sentences for segmentation, and so on. Each of these methods has its pros and cons, and there is no definitive answer indicating that one method outperforms the others.
Given the limited conditions, large-scale machine learning is not realistic, so we adopted the first method based on statistical analysis for word segmentation.
Now that we have the method, the question arises: how do we know which combinations of characters in a sentence constitute a word? Maintaining a dictionary for this judgment? But then comes another question, where does the dictionary come from? Searching online? But what’s the difference between that and directly using someone else’s library? Is there a way for machines to learn and determine words on their own?
Of course, there is! We can find all possible combinations of words in a sentence and then determine the frequency of these combinations; those with frequencies above a certain threshold can be classified as phrases.
OK, we have come up with a 'viable' method, but is it reliable to find all character combinations in an article? That would lead to astronomical numbers, right?!! Hold on, let's take it step by step. Here I'll leave you in suspense, later I will implement this in conjunction with trie trees and HMMs (Hidden Markov Models).
Let’s get started!
Before starting, here’s the implementation code: implementation code.
OK, Let's Do It!
First, we obtain our material from The Three-Body Problem II - The Dark Forest, which we temporarily save as text.txt
.
Create a new file named task.js
, and we will start coding.
// Import the fs module
var fs = require('fs')
// Create some variables; we will talk about them later
var trie = {}
var tempWords = {};
// Start reading the file text.txt
fs.readFile('./text.txt',function(err,callback){
// Convert the binary read into a string,
// Note: for large files, be careful with chunk processing to avoid memory overflow.
var str = callback.toString();
// Next, we need to "clean" our material,
// First, replace all non-Chinese characters with "@",
// "@" is just a temporary string; it can be any character you like
// For example:
// "大家好!:)我是Mofei" => "大家好@@@我是@@@@@"
str = str.replace(/[^\\u4e00-\u9fa5]/g,'@');
// Replace all replaced @ with spaces
// For example:
// "大家好@@@我是@@@@@" => "大家好 我是 "
str = str.replace(/@+/g,' ')
split(str);
})
Through the above operations, we obtain a relatively clean dataset, and now we focus on our main task.
Here we need to handle all possible combinations of characters in a sentence. Given that most Chinese words typically have four characters or fewer, let's take four characters as an example:
For instance, the string "欢迎大家关注" can be combined as follows:
欢迎大家
迎大家关
大家关注
欢迎大
迎大家
大家关
家关注
欢迎
迎大
大家
家关
关注
Six simple characters can yield 12 combinations; should we record all combinations? STOP! When you have such an idea, your computer must be crying. For a lengthy novel, this could indeed be catastrophic! So can we implement a better method? That's precisely the trie tree! No need to explain the principles behind it; everyone can look it up themselves.
Based on the trie, I added a count field len
to help us make judgments.
After structuring the tree, we arrive at a structure like this:
{
"欢": {
"迎": {
"大": {
"家": {
"len": 1
},
"len": 2
},
"len": 3
},
"len": 3
},
"迎": {
"大": {
"家": {
"关": {
"len": 1
},
"len": 2
},
"len": 3
},
"len": 3
},
"大": {
"家": {
"关": {
"注": {
"len": 1
},
"len": 2
},
"len": 3
},
"len": 3
},
"家": {
"关": {
"注": {
"len": 1
},
"len": 2
},
"len": 2
},
"关": {
"注": {
"len": 1
},
"len": 1
}
}
From the len
values above, we can simplify it to the following few "words" (continuing down from the root node until there are unusual changes in len):
欢迎
迎大
大家
家关
关注
OK, we have simplified from 12 combinations to 5.
Wait a moment! WTF!!! What are
迎大
and家关
? Are you kidding me? Are these even words?
At this point, everyone should not rush; did you notice the len
field behind? And we are currently only using one sentence. Let’s tentatively call this trie tree
our model (it’s actually just the result of training
).
"What the heck?
Training
?"
Exactly!! If we train with more statements, the len
here will show significant changes. Ultimately, we can establish a policy, such as discarding all those with len less than 10, or only keeping those whose len deviation from the root node is greater than a certain value, etc.
After running several articles, you will find that you have indeed trained
a dictionary! If you successfully implemented this, let’s pop a bottle of champagne to celebrate!
Now, back to our code:
// Process the sentences
function split(str) {
// Split the cleaned string by spaces,
// Ultimately, we obtain an array of sentences
str = str.split(' ');
for (var i = 0; i < str.length; i++) {
// Split each sentence into individual characters
var words = str[i];
if (words.length <= 1) {
// Discard instances of a single character
continue;
}
if (words.length === 2) {
// Directly process if there are two characters
wordsToTrie(words);
} else {
// For lengths greater than 2,
// find all combinations of characters for processing
for (var j = 0; j < words.length - 2; j++) {
// We take 4 as the threshold, meaning the maximum word length is 4
// Naturally, you can choose a larger value, but the larger it is, the more it consumes performance
wordsToTrie(words.substr(j, 4));
}
}
}
// Convert the trie tree into a word dictionary
// That is, {"你好":14,"大楼":45}
trieToWords();
// Output the result
wordsToArrAndRank();
}
function wordsToTrie(str) {
var words = str.split('');
// stopWords are easily understood; some words occur too frequently in language and have no actual meaning,
// such as "我的" "这个" etc., which do not assist in searching or determining the characteristics of an article,
// we can filter them out
// There are existing stopword dictionaries online; here I selected a small portion
var stopWords = ["和", "与", "你", "我", "两", "这", "把", "那", "个", "他", "您", "它", "们", "是", "的", "一", "了", "在"]
if (stopWords.indexOf(words[0]) !== -1) {
return false;
}
var temp = trie;
for (var i = 0; i < words.length; i++) {
// Store the string in the trie tree
temp = saveToTrie(temp, words[i]);
}
}
// Store strings in the trie tree
function saveToTrie(obj, chart) {
// If it does not exist, create a new space
// For example, {"你":{len:0}} + "我" =>
// {"你":{"我":{len:0},len:0}}
obj[chart] = obj[chart] || {
len: 0
}
// If it exists, increment the count
obj[chart].len += 1;
return obj[chart];
}
function trieToWords() {
var words = [];
conmbine(trie, '');
}
Thus, our trie tree has been "trained" successfully, and now we will handle some minor cleanup work.
// Convert the trie tree into a word dictionary
// For example:
// {"你好":14,"大楼":45}
// Here you can set your own rules
// Such as setting a len threshold, stopping when len changes abnormally, etc.
// Example:
// {"你:{
// "好":{
// "明":{
// "天":{
// len: 2,
// }
// len: 2,
// },
// len: 14
// },
// len: 14
// }
// }
// In this case, we can determine that the word ends at "好",
// because the weight of the subsequent “明” is only 2, clearly lower than the 14 before.
function conmbine(obj, str) {
var retObj = [];
var haveSone = false;
var pow = obj.len;
for (i in obj) {
if (obj[i].len <= 6) {
continue;
}
if (i !== 'len') {
conmbine(obj[i], str + i);
}
}
if (!haveSone && str.length >= 2) {
tempWords[str] = tempWords[str] || 0
tempWords[str] = Math.max(tempWords[str], pow);
}
}
Next, it’s time to output the results.
// Output results and sort by word frequency
function wordsToArrAndRank() {
var wordsArr = [];
var keys = [];
for (var i in tempWords) {
keys.push(i);
}
keys = '|'+keys.join('|')+'|';
for (var i in tempWords) {
// Note this regex, it was made to satisfy the competition rules
// It could actually be removed,
// the rule is:
// For "苹果", "青苹果", take only "青苹果"
// For "宝石", "红宝石", take only "红宝石"
// Of course, if you need to judge weights, you can check here; I will not demonstrate that
var noLeft = !RegExp('[^|]+' + i + '\|').test(keys);
var noRight = !RegExp('\|+' + i + '[^|]+').test(keys);
if (noLeft && noRight) {
wordsArr.push([i, tempWords[i]]);
}
}
// Sort by word frequency
wordsArr.sort(function (a, b) {
return a[1] - b[1];
});
// Output
console.log(JSON.stringify(wordsArr));
console.log(wordsArr);
}
Below are the results obtained using this algorithm (only the top 20 results are shown for reference):
['必须', 59 ],
[ '第一次', 60 ],
[ '眼睛', 60 ],
[ '首先', 61 ],
[ '来自', 62 ],
[ '三体世界', 63 ],
[ '似乎', 64 ],
[ '立刻', 64 ],
[ '变得', 65 ],
[ '突然', 69 ],
[ '信息', 69 ],
[ '字幕', 70 ],
[ '虽然', 76 ],
[ '东方延绪', 81 ],
[ '几乎', 81 ],
[ '正在', 85 ],
[ '完全', 87 ],
[ '自然选择', 91 ],
[ '面壁计划', 94 ],
[ '雷迪亚兹', 195 ]
Appendix: code portal