Changes between Version 3 and Version 4 of CharedTool


Ignore:
Timestamp:
Oct 26, 2024, 10:19:49 PM (8 months ago)
Author:
qstengl
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CharedTool

    v3 v4  
    44Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether.
    55
    6 == Handle
    7 [http://hdl.handle.net/11858/00-097C-0000-000D-F67A-9]
     6== Source
     7[https://corpus.tools/wiki/Chared]
    88
    99== Acknowledgements
    1010This software has been developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of [http://www.muni.cz/ Masaryk University in Brno] with financial support from [http://presemt.eu PRESEMT] and [http://www.sketchengine.co.uk Lexical Computing Ltd.]
     11
     12If you use the system, please cite the related publication as well as the LINDAT/CLARIAH infrastructure: [http://hdl.handle.net/11858/00-097C-0000-000D-F67A-9]
    1113
    1214{{{