Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 in JSON response isn't handled #97

Open
grv87 opened this issue Feb 24, 2015 · 12 comments
Open

UTF-8 in JSON response isn't handled #97

grv87 opened this issue Feb 24, 2015 · 12 comments

Comments

@grv87
Copy link

grv87 commented Feb 24, 2015

I use the following Request:

    Request.Method = WebMethod.HttpPost
    Request.ResponseFormat = WebFormat.Json

Response is JSON with strings in UTF-8, which aren't converted into VBA strings.

@timhall
Copy link
Member

timhall commented Mar 2, 2015

Hi @grv87, thanks for bringing this up, I've been worried about this issue as I hadn't set up good ways to test it just yet.

Can you provide some more details on what was expected vs actual? Did it complete parsing the request/response without errors, but parse it incorrectly or were errors thrown when parsing? Is the API public or do you have an example that I can try locally?

@grv87
Copy link
Author

grv87 commented Mar 3, 2015

It completes parsing without errors, but incorrectly. Response.Data contains mojibake.

I've put my problematic example here: https://yadi.sk/i/7_Fs3GTyf2HA7 (I hope it is not overcomplicated. If it is --- let me know, I'll make simpler one).

@yadimon
Copy link

yadimon commented Apr 9, 2015

i think its like:
PHP:
echo json_encode("text with symbols like ü")
and vba:
Debug.Print Response.Content
gives
text with symbols like \u00fc

possible solution:
https://gist.github.com/yadimon/ce1d04b88de17064bfae

@grv87
Copy link
Author

grv87 commented Apr 10, 2015

Content is not escaped, it is in plain UTF-8.

@Bazuum
Copy link

Bazuum commented Apr 14, 2015

This function worked for me, give it a try:

Function UTF8_Decode(ByVal sStr As String)
    Dim l As Long, sUTF8 As String, iChar As Integer, iChar2 As Integer
    For l = 1 To Len(sStr)
        iChar = Asc(Mid(sStr, l, 1))
        If iChar > 127 Then
            If Not iChar And 32 Then ' 2 chars
            iChar2 = Asc(Mid(sStr, l + 1, 1))
            sUTF8 = sUTF8 & ChrW$(((31 And iChar) * 64 + (63 And iChar2)))
            l = l + 1
        Else
            Dim iChar3 As Integer
            iChar2 = Asc(Mid(sStr, l + 1, 1))
            iChar3 = Asc(Mid(sStr, l + 2, 1))
            sUTF8 = sUTF8 & ChrW$(((iChar And 15) * 16 * 256) + ((iChar2 And 63) * 64) + (iChar3 And 63))
            l = l + 2
        End If
            Else
            sUTF8 = sUTF8 & Chr$(iChar)
        End If
    Next l
    UTF8_Decode = sUTF8
End Function
  • Add some Error handling...

@grv87
Copy link
Author

grv87 commented Apr 23, 2015

Doesn't work when I use it like this:

For Each kv In Response.Data("rows")
    Value = UTF8_Decode(kv("NAME"))
Next

kv("NAME") is already mojibake -- something like ??N?????N??. Asc(Mid(sStr, l, 1)) gives 63, as is for question mark.

@hjoukl
Copy link

hjoukl commented May 20, 2019

Really cool library!

While I don't really have VBA experience my take on the utf-8-issues is:

The fundamental problem is that when the actual JSON parsing happens, the text data has already been "converted" from the UTF-8 encoded data as returned by the http request to a VBA string.

Looks like the WinHttpRequest doesn't properly decode the data from the response and encode to the VBA "unicode" encoding. You're basically not able to control the decoder encoding used to populate the IWinHttpRequest::ResponseText property.

So it might make sense to rather base the JSON parsing and text access on the ResponseBody byte array,
i.e. both the population of Response.Data as well as Response.Content.

Indeed, I've been able to properly access JSON data using these modifications to WebHelpers:

Public Function ParseByFormat(Value As String, Format As WebFormat, _
    Optional CustomFormat As String = "", Optional Bytes As Variant) As Object

    On Error GoTo web_ErrorHandling

    ' Don't attempt to parse blank values
    If Value = "" And CustomFormat = "" Then
        Exit Function
    End If

    Select Case Format
    Case WebFormat.Json
        ' Set ParseByFormat = ParseJson(Value)
        ' Hm, contrary to what I thought JSON isn't required to be UTF-8-encoded,
        ' see e.g. https://www.ietf.org/rfc/rfc4627.txt
        ' So to do this properly you'd either
        ' - need to respect http header encoding information, if given
        ' - detect UTF-encoding notion by looking at the first bytes, see also
        '   https://www.ietf.org/rfc/rfc4627.txt
        Set ParseByFormat = ParseJson(Utf8BytesToString(Bytes))
    Case WebFormat.FormUrlEncoded
        Set ParseByFormat = ParseUrlEncoded(Value)

The actual conversion function is gracefully taken from here (with the very basic modification of
using Variant instead of Byte array as the function parameter):
https://www.di-mgt.com.au/howto-convert-vba-unicode-to-utf8.html#Utf8BytesToString

''' Maps a character string to a UTF-16 (wide character) string
Private Declare Function MultiByteToWideChar Lib "kernel32" ( _
    ByVal CodePage As Long, _
    ByVal dwFlags As Long, _
    ByVal lpMultiByteStr As Long, _
    ByVal cchMultiByte As Long, _
    ByVal lpWideCharStr As Long, _
    ByVal cchWideChar As Long _
    ) As Long

' CodePage constant for UTF-8
Private Const CP_UTF8 = 65001

''' Gracefully taken from https://www.di-mgt.com.au/howto-convert-vba-unicode-to-utf8.html#Utf8BytesToString
''' Return length of byte array or zero if uninitialized
Private Function BytesLength(abBytes() As Byte) As Long
    ' Trap error if array is uninitialized
    On Error Resume Next
    BytesLength = UBound(abBytes) - LBound(abBytes) + 1
End Function

''' Return VBA "Unicode" string from byte array encoded in UTF-8
'Changed signature to Variant instead of Byte array
'Public Function Utf8BytesToString(abUtf8Array() As Byte) As String
Public Function Utf8BytesToString(Bytes As Variant) As String
    Dim abUtf8Array() As Byte
    Dim nBytes As Long
    Dim nChars As Long
    Dim strOut As String
    
    abUtf8Array = Bytes
    
    Utf8BytesToString = ""
    ' Catch uninitialized input array
    nBytes = BytesLength(abUtf8Array)
    If nBytes <= 0 Then Exit Function
    ' Get number of characters in output string
    nChars = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(abUtf8Array(0)), nBytes, 0&, 0&)
    ' Dimension output buffer to receive string
    strOut = String(nChars, 0)
    nChars = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(abUtf8Array(0)), nBytes, StrPtr(strOut), nChars)
    Utf8BytesToString = Left$(strOut, nChars)
End Function

As denoted in the snippet, there are still problems to consider with this approach:

Caution: I've no idea whatsoever about VBA/Win/Excel/Mac compatibility regarding the conversion function ;-).

@hjoukl
Copy link

hjoukl commented May 20, 2019

Followup: Along the same lines, a rather hacky way to address the Response.Content property by resetting it in case of JSON data:

Private Sub web_LoadValues(web_Headers As String, web_Content As String, web_Body As Variant, web_Request As WebRequest)
    ' Convert content to data by format
    If web_Request.ResponseFormat <> WebFormat.PlainText Then
        On Error Resume Next
        Set Me.Data = _
            WebHelpers.ParseByFormat(web_Content, web_Request.ResponseFormat, web_Request.CustomResponseFormat, web_Body)

        If Err.Number <> 0 Then
            WebHelpers.LogError Err.Description, Err.Source, Err.Number
            Err.Clear
        End If
        On Error GoTo 0
        
        On Error Resume Next
        Select Case web_Request.ResponseFormat
        Case WebFormat.Json
            ' TODO: Cater for other encodings: JSON does not need to be UTF-8
            Me.Content = Utf8BytesToString(web_Body)
        End Select
        If Err.Number <> 0 Then
            WebHelpers.LogError Err.Description, Err.Source, Err.Number
            Err.Clear
        End If
        On Error GoTo 0
    End If
' ...

(please bear with my obvious VBA deficiencies)

@timhall
Copy link
Member

timhall commented May 20, 2019

Thanks for the notes @hjoukl. This issue is still a very high priority, I just haven't been able to do it in a cross-platform manner.

  • My plan is to separate encoding and formatting, since, like you said, you can have JSON without UTF-8 and UTF-8 without JSON.
  • Some libc notes for Mac: Character Set Handling
  • I've accepted an RFC to implement it for Windows behind a feature flag, but haven't had a chance to work on implementing it just yet.

@hjoukl
Copy link

hjoukl commented May 20, 2019

Thanks for your clarification @timhall. I guess the problem wrt to interop is the usage of the "kernel32" function to power the utf-8 decoder, then.

Keep up the good work!

@sokol92
Copy link

sokol92 commented Jun 25, 2019

Thank you very much for your wonderful programs!
Нere is pure VBA code to convert strings to utf8 encoding and back. Correctness on the Windows platform was checked by comparing the results of the Adodb.Stream methods. Performance - approximately 10 Mb per second.
Maybe it will be useful for Mac.

Function VBA_StrToUtf8(p_s As String) As Byte()
    Dim arr() As Byte, l_s As Long, i As Long, n As Long, j As Long, n1 As Long, n2 As Long, n3 As Long
   
    l_s = VBA.Len(p_s)
    ReDim arr(1 To 4 * l_s)
   
    For i = 1 To l_s
        n = VBA.AscW(Mid(p_s, i, 1))
        If n < 0 Then
            n = n + 65536
        End If
              
        If n <= 127 Then ' 1 Byte
            arr(j + 1) = n
            j = j + 1
        Else
            n1 = n Mod 64
            n = n \ 64
            If n <= 31 Then ' 2 Bytes
                arr(j + 1) = 192 + n
                arr(j + 2) = 128 + n1
                j = j + 2
          
            Else
                n2 = n Mod 64
                n = n \ 64
                If n <= 15 Then ' 3 bytes
                    arr(j + 1) = 224 + n
                    arr(j + 2) = 128 + n2
                    arr(j + 3) = 128 + n1
                    j = j + 3
            
                Else            ' 4 bytes
                    n3 = n Mod 64
                    n = n \ 64
                    arr(j + 1) = 240 + n
                    arr(j + 2) = 128 + n3
                    arr(j + 3) = 128 + n2
                    arr(j + 4) = 128 + n1
                    j = j + 4
                End If
            End If
        End If
    Next i
        
    ReDim Preserve arr(1 To j)
    VBA_StrToUtf8 = arr
End Function

Function VBA_Utf8ToStr(p_arr() As Byte) As String
   
    Dim s As String, i As Long, i1 As Long, i2 As Long, j As Long, k As Long, n As Long
    i1 = LBound(p_arr)
    i2 = UBound(p_arr)
   
    If i2 < i1 Then
        Exit Function
    End If
   
    s = String(i2 - i1 + 1, ";")
    i = i1 - 1
   
    Do While i < i2
        n = p_arr(i + 1)
        Select Case n
          
            Case 0 To 127  ' 1 byte
                k = n
                i = i + 1
         
            Case 192 To 223 ' 2 bytes
                k = (n Mod 32) * 64 + p_arr(i + 2) Mod 64
                i = i + 2
         
            Case 224 To 239 ' 3 bytes
                k = (n Mod 16) * 64 * 64 + (p_arr(i + 2) Mod 64) * 64 + (p_arr(i + 3) Mod 64)
                i = i + 3
         
            Case 240 To 247 ' 4 bytes
                k = (n Mod 16) * 64 * 64 * 64 + (p_arr(i + 2) Mod 64) * 64 * 64 + (p_arr(i + 3) Mod 64) * 64 + (p_arr(i + 4) Mod 64)
                i = i + 4
       
            Case Else  ' error
                k = -1
                i = i + 1
        End Select
      
        If k >= 0 Then
            j = j + 1
            Mid$(s, j) = VBA.ChrW(k)
        End If

    Loop
   
    If j >= 1 Then
        VBA_Utf8ToStr = Left$(s, j)
    End If
End Function

@fzumstein
Copy link

@sokol92 I wasn't able to get this to work on macOS, but I may not have tried at the correct place. So adding this info to your post would help to confirm.

The only thing that did work for me on macOS was to use this function on the parsed data (i.e., what you get back from MyResponse.Data, see also https://stackoverflow.com/a/64624336/918626:

Function Utf8ToUtf16(ByVal strText As String) As String
    ' macOs only: apparently, Excel uses UTF-16 to represent string literals
    ' Taken from https://stackoverflow.com/a/64624336/918626
    Dim i&, l1&, l2&, l3&, l4&, l&
    For i = 1 To Len(strText)
        l1 = Asc(Mid(strText, i, 1))
        If i + 1 <= Len(strText) Then l2 = Asc(Mid(strText, i + 1, 1))
        If i + 2 <= Len(strText) Then l3 = Asc(Mid(strText, i + 2, 1))
        If i + 3 <= Len(strText) Then l4 = Asc(Mid(strText, i + 3, 1))
        Select Case l1
        Case 1 To 127
            l = l1
        Case 194 To 223
            l = ((l1 And &H1F) * 2 ^ 6) Or (l2 And &H3F)
            i = i + 1
        Case 224 To 239
            l = ((l1 And &HF) * 2 ^ 12) Or ((l2 And &H3F) * 2 ^ 6) Or (l3 And &H3F)
            i = i + 2
        Case 240 To 255
            l = ((l1 And &H7) * 2 ^ 18) Or ((l2 And &H3F) * 2 ^ 12) Or ((l3 And &H3F) * 2 ^ 6) Or (l4 And &H3F)
            i = i + 4
        Case Else
            l = 63 ' question mark
        End Select
        Utf8ToUtf16 = Utf8ToUtf16 & IIf(l < 55296, WorksheetFunction.Unichar(l), "?")
    Next i
End Function

(for Windows #305 seem to work nicely, but needs minor changes so it's ignored on macOS and works across 32/64bit Excel versions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants