Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] A mix of 8-bit/16-bit chars sent to iconv #1451

Open
erankor opened this issue Aug 24, 2022 · 8 comments
Open

[BUG] A mix of 8-bit/16-bit chars sent to iconv #1451

erankor opened this issue Aug 24, 2022 · 8 comments

Comments

@erankor
Copy link

erankor commented Aug 24, 2022

Necessary information

  • Is this a regression (i.e. did it work before)? NO
  • What platform did you use? Linux
  • What were the used arguments? ./ccextractor test.ts -svc all[UTF-16BE] -nofc -12

Video links

http://cdnapi.kaltura.com/p/2035982/playManifest/entryId/1_frxnu0yr/flavorId/1_tr3kiz6l/format/download/a.ts

Additional information

Hi all,

I have some TS file with 708 subtitles in Japanese & Chinese that failed to decode properly.
After some debugging, I found that if I patch the function write_utf16_char here -
https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708_output.c#L113
to always output 2 byte chars (I changed the if to if (1)), and I specify an encoding of UTF-16BE, it decodes properly.

This code looks off to me, as it creates a mix of 8-bit & 16-bit chars with no clear encoding (it's not UTF-8 and it's not UTF-16...).
Maybe when iconv is used, the function should always output 2 byte chars?
Or, alternatively, if it would use 2-bytes for ALL chars if there is ANY char that doesn't fit in 1-byte, it would also be ok (but this sounds more complex to do...).

Btw, VLC decodes the Japanese & Chinese properly, after changing the 'preferred closed captions decoder' setting from 608 to 708.

Thanks!

Eran

@PunitLodha
Copy link
Member

Could you share the output of ccextractor --version?

@erankor
Copy link
Author

erankor commented Aug 24, 2022

./ccextractor --version
CCExtractor 0.89, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.89
        Git commit: b793f16343dc442bcb977387fcef08195e71dd7c
        Compilation date: 2022-08-23
        File SHA256: 259ccd18d508a3aed03149080853f98d1bce57672ce20c9b715953227621c9d9
Libraries used by CCExtractor
        Tesseract Version: 3.03
        Leptonica Version: leptonica-1.70
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi

@PunitLodha
Copy link
Member

You are using version 0.89. Could you try using the latest version(0.94)?

@erankor
Copy link
Author

erankor commented Aug 24, 2022

Reverted my change and pulled latest master, it is decoding stuff (which is better than previous version IIRC...), but still every space in the text messes it up, and I get some non-printable chars in the output.

Output without any code changes -
1
00:00:01,068 --> 00:00:03,770
人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰

Output after forcing write_utf16_char to always use 2 chars -
1
00:00:01,068 --> 00:00:03,770
人々が私を知 ったとき、私は 時間管理につい て書いています

I don't speak Japanese myself :) but google translate can confirm the fixed version is better.

Current version -

./ccextractor --version
CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.94
        Git commit: 4cb474c5a36b61bafec4a2379c4d0b240e44359b
        Compilation date: 2022-08-24
        CEA-708 decoder: C
        File SHA256: 8fd4f5625eb6aadb30532a2ff9f29adaec4b60a77916e3f001d5f4e59d4d08e9
Libraries used by CCExtractor
        Tesseract Version: 3.03
        Leptonica Version: leptonica-1.70
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi

@PunitLodha
Copy link
Member

You could send a PR. If it doesn't cause any issues with the other tests, then we can merge it

@ArchitBhonsle
Copy link
Contributor

Was this fixed? I could make a simple pull request with the specified changes.

@cfsmp3
Copy link
Contributor

cfsmp3 commented Feb 26, 2023

Was this fixed? I could make a simple pull request with the specified changes.

Probably not if it's still open :-)
Feel free to give it a shot.

@prateekmedia
Copy link
Member

Created a PR: #1571

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants