.srt character encoding troubles

For help and support with Universal Media Server in general
Forum rules
Please make sure you follow the Problem Reporting Guidelines before posting if you want a reply
mikeaoller
Posts: 31
Joined: Sat Sep 22, 2012 9:09 am

.srt character encoding troubles

Post by mikeaoller » Thu May 23, 2019 1:00 am

Hi,

when trying to find working Russian subtitles for the actual GoT season, I came across at least 3 types of differently encoded .srt files. Here some examples:
1
00:00:00,000 --> 00:00:04,162
WWW.MY-SUBS.COM

1
00:00:02,029 --> 00:00:04,300
Мы приносили друг другу страдания.

2
00:00:04,529 --> 00:00:07,237
Мы отбирали друг у друга любимых.
00:00:00,000 --> 00:00:04,162
WWW.MY-SUBS.COM

1
00:00:01,578 --> 00:00:03,827
Ã˚ Ô˘ËÌˇÎË ÁÎÓ ‰Û„ ‰Û„Û.

2
00:00:04,494 --> 00:00:07,202
Ã˚ Î˯ËÎË ‰Û„ ‰Û„‡ ÚÂı,
ÍÓ„Ó Ï˚ β·ËÎË.
00:00:00,000 --> 00:00:04,162
WWW.MY-SUBS.COM

1
00:02:06,973 --> 00:02:08,015
ОстороР♪РЅРѕ.

2
00:02:10,307 --> 00:02:11,307
Шевелись!
Why is that? What are the differences? Why do only some of them work in UMS? How do I convert non-working files (either simply not showing when playing the video or being recognised as "unknown", although they should all be Russian) to a working standard? What would the working standard be?

Please, dear developers, shed a bit of light on this.

Thank you & best wishes,
Jonathan

Nadahar
Posts: 1476
Joined: Tue Jun 09, 2015 5:57 pm

Re: .srt character encoding troubles

Post by Nadahar » Thu May 23, 2019 7:51 am

It all depends on the character encoding used in the .srt (text) files. UMS will try to auto-detect the encoding, but this auto-detection is quite unreliable. The reason is that there's no way to know what character encoding is valid without interpreting the resulting text as being valid or not. To do that, you'd need to try all different encodings, and then have some kind of an AI analyze what resulting text makes most sense. As such, converting them to UTF-8 is probably a good idea.

When it comes to which language UMS recognizes them as, there's also an auto-detection, but this is even less reliable. It can only detect relatively few languages, and relies on a statistical analysis of the content using n-grams. This leaves a lot to be desired. There's an easy way for you to tell UMS (and some other programs) what language the subtitles are though, simply add the ISO-639 two or three letter code to the filename just before the extension. Example

Code: Select all

myvideo.srt --> myvideo.rus.srt
When the language is specified in the file-name, auto-detection won't even be attempted, so it will always be correct.

mikeaoller
Posts: 31
Joined: Sat Sep 22, 2012 9:09 am

Re: .srt character encoding troubles

Post by mikeaoller » Thu May 23, 2019 7:21 pm

Thank you, Nadahar!

I really wonder if the developers / moderators are on holiday at the moment?
Nadahar wrote:
Thu May 23, 2019 7:51 am
It all depends on the character encoding used in the .srt (text) files. UMS will try to auto-detect the encoding, but this auto-detection is quite unreliable. The reason is that there's no way to know what character encoding is valid without interpreting the resulting text as being valid or not. To do that, you'd need to try all different encodings, and then have some kind of an AI analyze what resulting text makes most sense. As such, converting them to UTF-8 is probably a good idea.
All three files are, as stated by Notepad++, encoded as UTF-8-BOM. Nonetheless the characters show up completely different. For Russian subs I'd expect cyrillic characters as in the example quoted first...but no. Funny enough: The .srt files actually showing cyrillic characters are recognised as "unknown" and don't work. So?

I'm really really missing unequivocal and transparent guidelines how a file needs to be encoded for which language to work with UMS.
When it comes to which language UMS recognizes them as, there's also an auto-detection, but this is even less reliable. It can only detect relatively few languages, and relies on a statistical analysis of the content using n-grams. This leaves a lot to be desired. There's an easy way for you to tell UMS (and some other programs) what language the subtitles are though, simply add the ISO-639 two or three letter code to the filename just before the extension. Example

Code: Select all

myvideo.srt --> myvideo.rus.srt
When the language is specified in the file-name, auto-detection won't even be attempted, so it will always be correct.
That's how I renamed all the files. Nonetheless UMS recognises only two of six as "Russian", the others are detected as "unknown".

I would really need a way to convert the files to something usable by UMS...

Nadahar
Posts: 1476
Joined: Tue Jun 09, 2015 5:57 pm

Re: .srt character encoding troubles

Post by Nadahar » Fri May 24, 2019 12:45 am

If the naming doesn't work, there must be some bug. When it comes to usable, did you convert them into UTF-8? That can be done using for example Notepad++

mikeaoller
Posts: 31
Joined: Sat Sep 22, 2012 9:09 am

Re: .srt character encoding troubles

Post by mikeaoller » Fri May 24, 2019 9:50 pm

mikeaoller wrote:
Thu May 23, 2019 7:21 pm
All three files are, as stated by Notepad++, encoded as UTF-8-BOM. Nonetheless the characters show up completely different. For Russian subs I'd expect cyrillic characters as in the example quoted first...but no. Funny enough: The .srt files actually showing cyrillic characters are recognised as "unknown" and don't work. So?

I'm really really missing unequivocal and transparent guidelines how a file needs to be encoded for which language to work with UMS.
As I said. All files are UTF-8-BOM. 2 of them work, 4 others don't.

Nadahar
Posts: 1476
Joined: Tue Jun 09, 2015 5:57 pm

Re: .srt character encoding troubles

Post by Nadahar » Fri May 24, 2019 11:14 pm

Something isn't right - if they are all UTF-8 with BOM but only some show Cyrillic characters, they aren't REALLY UTF-8. It is possible to add a UTF-8 BOM (the BOM is just some "magic bytes" in the very start of the file) even if the encoding is really something else. However, most software will trust the BOM and interpret the file as UTF-8. If the actual encoding isn't UTF-8, characters will show as "garbage".

When it comes to "the developers", they often aren't very responsive. That said, I used to be a UMS developer, and I made most of the current subtitles code in use. If I'm not good enough for you, I doubt you will find fulfillment here. Another reason why the developers don't answer might be that you haven't followed the problem reporting guidelines as stated in red above. In most cases, like here, everything anyone can do without the debug files is guess - and that's often not worth the time.

If you want this figured out, you should attach the debug files. In addition, you should zip at least one working and one non-working SRT files and attach that zip file. With that, I can probably find the actual reason it doesn't work instead of guessing.
Last edited by Nadahar on Sat May 25, 2019 12:24 am, edited 1 time in total.

User avatar
squadjot
Moderator
Posts: 651
Joined: Fri Jun 01, 2012 4:24 am

Re: .srt character encoding troubles

Post by squadjot » Fri May 24, 2019 11:24 pm

Nadahar, i just want to say thank you for being active on the forum.
It's a shame you arent on the development team anymore, but it's so great that you stick around anyways.
Your knowledge, insights and willingness to help other users is priceless.

Nadahar
Posts: 1476
Joined: Tue Jun 09, 2015 5:57 pm

Re: .srt character encoding troubles

Post by Nadahar » Fri May 24, 2019 11:46 pm

Thank you for your kind words squadjod :)

mikeaoller
Posts: 31
Joined: Sat Sep 22, 2012 9:09 am

Re: .srt character encoding troubles

Post by mikeaoller » Sat May 25, 2019 12:20 am

Nadahar wrote:
Fri May 24, 2019 11:14 pm
Another reason why the developers don't answer might be that you haven't followed the problem reporting guidelines as stated in red above. In most cases, like here, anyone can do without the debug files is guess - and that's often not worth the time.

If you want this figured out, you should attach the debug files. In addition, you should zip at least one working and one non-working SRT files and attach that zip file. With that, I can probably find the actual reason it doesn't work instead of guessing.
At first, this wasn't a support request, but only a request for clarification. But within the next hour or so, I'll upload the debug files and two srt files here. Thank you for helping out.

mikeaoller
Posts: 31
Joined: Sat Sep 22, 2012 9:09 am

Re: .srt character encoding troubles

Post by mikeaoller » Sat May 25, 2019 1:30 am

Here the requested files. The srt.zip contains 3 srt-files in 3 different encodings, only the first one is working properly, although it shows the most "garbage". I named the files "working_..." and "not_working...".
Attachments
ums_dbg.zip
(91.62 KiB) Downloaded 36 times
srt.zip
(39.91 KiB) Downloaded 37 times

Post Reply