I downloaded entire English Wikipedia dump and extracted all paragraphs into separate sentences inside a single 15 GB plain text (uploaded is compressed to 4 GB) file. Using the simple application, you can find at most 1000 sentences containing the phrase.
3 个赞
There appears to be 4.89 million duplicate lines, or approx 450MB in file size wasted.
Wiki_Single_Sentences.exe 这个程序运行不了呢?
What is the error message?
Could you please say the exact defect? So that I can work on it.
There are millions of identical sentences in the Single.txt file. You can use emeditor (30 days trial) to dedupe the file. Maybe Notepad+ can do the same.