[BUG] A mix of 8-bit/16-bit chars sent to iconv #1451

erankor · 2022-08-24T06:49:56Z

Necessary information

Is this a regression (i.e. did it work before)? NO
What platform did you use? Linux
What were the used arguments? ./ccextractor test.ts -svc all[UTF-16BE] -nofc -12

Video links

http://cdnapi.kaltura.com/p/2035982/playManifest/entryId/1_frxnu0yr/flavorId/1_tr3kiz6l/format/download/a.ts

Additional information

Hi all,

I have some TS file with 708 subtitles in Japanese & Chinese that failed to decode properly.
After some debugging, I found that if I patch the function write_utf16_char here -
https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708_output.c#L113
to always output 2 byte chars (I changed the if to if (1)), and I specify an encoding of UTF-16BE, it decodes properly.

This code looks off to me, as it creates a mix of 8-bit & 16-bit chars with no clear encoding (it's not UTF-8 and it's not UTF-16...).
Maybe when iconv is used, the function should always output 2 byte chars?
Or, alternatively, if it would use 2-bytes for ALL chars if there is ANY char that doesn't fit in 1-byte, it would also be ok (but this sounds more complex to do...).

Btw, VLC decodes the Japanese & Chinese properly, after changing the 'preferred closed captions decoder' setting from 608 to 708.

Thanks!

Eran

The text was updated successfully, but these errors were encountered:

PunitLodha · 2022-08-24T13:10:34Z

Could you share the output of ccextractor --version?

erankor · 2022-08-24T13:17:30Z

./ccextractor --version
CCExtractor 0.89, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.89
        Git commit: b793f16343dc442bcb977387fcef08195e71dd7c
        Compilation date: 2022-08-23
        File SHA256: 259ccd18d508a3aed03149080853f98d1bce57672ce20c9b715953227621c9d9
Libraries used by CCExtractor
        Tesseract Version: 3.03
        Leptonica Version: leptonica-1.70
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi

PunitLodha · 2022-08-24T13:20:15Z

You are using version 0.89. Could you try using the latest version(0.94)?

erankor · 2022-08-24T14:03:03Z

Reverted my change and pulled latest master, it is decoding stuff (which is better than previous version IIRC...), but still every space in the text messes it up, and I get some non-printable chars in the output.

Output without any code changes -
1
00:00:01,068 --> 00:00:03,770
人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰

Output after forcing write_utf16_char to always use 2 chars -
1
00:00:01,068 --> 00:00:03,770
人々が私を知ったとき、私は時間管理について書いています

I don't speak Japanese myself :) but google translate can confirm the fixed version is better.

Current version -

./ccextractor --version
CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.94
        Git commit: 4cb474c5a36b61bafec4a2379c4d0b240e44359b
        Compilation date: 2022-08-24
        CEA-708 decoder: C
        File SHA256: 8fd4f5625eb6aadb30532a2ff9f29adaec4b60a77916e3f001d5f4e59d4d08e9
Libraries used by CCExtractor
        Tesseract Version: 3.03
        Leptonica Version: leptonica-1.70
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi

PunitLodha · 2022-08-24T17:54:12Z

You could send a PR. If it doesn't cause any issues with the other tests, then we can merge it

ArchitBhonsle · 2023-02-26T12:31:46Z

Was this fixed? I could make a simple pull request with the specified changes.

cfsmp3 · 2023-02-26T18:37:37Z

Was this fixed? I could make a simple pull request with the specified changes.

Probably not if it's still open :-)
Feel free to give it a shot.

prateekmedia · 2023-09-26T18:10:21Z

Created a PR: #1571

cfsmp3 added the GSOC-2023 label Mar 17, 2023

prateekmedia mentioned this issue Sep 26, 2023

[FIX] Always write 2 bytes for UTF-16BE #1571

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] A mix of 8-bit/16-bit chars sent to iconv #1451

[BUG] A mix of 8-bit/16-bit chars sent to iconv #1451

erankor commented Aug 24, 2022

PunitLodha commented Aug 24, 2022

erankor commented Aug 24, 2022

PunitLodha commented Aug 24, 2022

erankor commented Aug 24, 2022

PunitLodha commented Aug 24, 2022

ArchitBhonsle commented Feb 26, 2023

cfsmp3 commented Feb 26, 2023

prateekmedia commented Sep 26, 2023

[BUG] A mix of 8-bit/16-bit chars sent to iconv #1451

[BUG] A mix of 8-bit/16-bit chars sent to iconv #1451

Comments

erankor commented Aug 24, 2022

Necessary information

Video links

Additional information

PunitLodha commented Aug 24, 2022

erankor commented Aug 24, 2022

PunitLodha commented Aug 24, 2022

erankor commented Aug 24, 2022

PunitLodha commented Aug 24, 2022

ArchitBhonsle commented Feb 26, 2023

cfsmp3 commented Feb 26, 2023

prateekmedia commented Sep 26, 2023