Chardet: The Universal Character Encoding Detector

Detects :

ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, CP932 (aka MS932), ISO-2022-JP (Japanese)
EUC-KR, ISO-2022-KR (Korean)
KOI8-R, x-mac-cyrillic (prev MacCyrillic), IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
ISO-8859-5, windows-1251 (Bulgarian)
ISO-8859-1, windows-1252 (Western European languages)
ISO-8859-7, windows-1253 (Greek)
ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)

note :
The ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily disabled until the models can be retrained.

About

This is a port to go of the excellent python chardet library<https://github.com/chardet/chardet>. It is based on the mozilla statistical encoding detector. v0.0.7 is based on the chardet version 0.0.4 (Dec 20)

Usage

The simplest way to use chardet is simply the package-level exported Detect method:

package main

import (
	"fmt"
	"github.com/olaure/chardet"
)

func main() {
	data := []byte("नमस्कार")
	detected := chardet.Detect(data)
	fmt.Printf(
		"Detectected character set : %v with confidence %v\n",
		detected.Encoding, detected.Confidence,
	)
}

Another way uses the method DetectShortestUTF8 that will look for the decoded string with the lowest count of unicode categories C (control), S (symbol), P (punctuation):

package main

import (
	"fmt"
	"github.com/olaure/chardet"
)

func main() {
	data := []byte("नमस्कार")
	detected := chardet.DetectShortestUTF8(data)
	fmt.Printf(
		"Detectected character set : %v with confidence %v\n",
		detected.Encoding, detected.Confidence,
	)
}

This function thus will not necessarily yield the highest probability decoder, unless the probability is maximum.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
tests		tests
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
big5freq.go		big5freq.go
big5prober.go		big5prober.go
chardet_test.go		chardet_test.go
chardistribution.go		chardistribution.go
charsetgroupprober.go		charsetgroupprober.go
charsetprober.go		charsetprober.go
codingstatemachine.go		codingstatemachine.go
cp949prober.go		cp949prober.go
detect.go		detect.go
enums.go		enums.go
escprober.go		escprober.go
escsm.go		escsm.go
eucjpprober.go		eucjpprober.go
euckrfreq.go		euckrfreq.go
euckrprober.go		euckrprober.go
euctwfreq.go		euctwfreq.go
euctwprober.go		euctwprober.go
gb2312freq.go		gb2312freq.go
gb2312prober.go		gb2312prober.go
go.mod		go.mod
go.sum		go.sum
hebrewprober.go		hebrewprober.go
jisfreq.go		jisfreq.go
jpcntx.go		jpcntx.go
langbulgarianmodel.go		langbulgarianmodel.go
langgreekmodel.go		langgreekmodel.go
langhebrewmodel.go		langhebrewmodel.go
langhungarianmodel.go		langhungarianmodel.go
langrussianmodel.go		langrussianmodel.go
langthaimodel.go		langthaimodel.go
langturkishmodel.go		langturkishmodel.go
latin1prober.go		latin1prober.go
mbcharsetprober.go		mbcharsetprober.go
mbcsgroupprober.go		mbcsgroupprober.go
mbcssm.go		mbcssm.go
sbcharsetprober.go		sbcharsetprober.go
sbcsgroupprober.go		sbcsgroupprober.go
sjisprober.go		sjisprober.go
universaldetector.go		universaldetector.go
utf8prober.go		utf8prober.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chardet: The Universal Character Encoding Detector

About

Usage

About

Releases

Packages

Languages

License

olaure/chardet

Folders and files

Latest commit

History

Repository files navigation

Chardet: The Universal Character Encoding Detector

About

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages