Non accepted characters with "Modify Fields" for UNICODE-to-STRING conversion

Comments

7 comments

  • Avatar
    Adrian Williams

    I cannot see your attachment with the error message. Can you please re-post it and, if possible, a sanitized sample of the data that causes the error.

    When I try importing some sample data from a .csv file using AutoDetect for the FileCharacterSet property value the data is imported with a unicode data type. If I then set the Type property for a field in the Modify Fields node, the data is converted to string data type:

     

    0
    Comment actions Permalink
  • Avatar
    Stéphane O

    here is the error message

     

     

    0
    Comment actions Permalink
  • Avatar
    Adrian Williams

    I assume the undefined characters in the value generating the error have acute accents. I created some test data (see attached file) and imported it using the Autodetect option on the CSV/Delimited File node. The conversion to string type in the Modify Fields node also worked as expected. 

    Can you open your source CSV data file in Notepad++ and check the encoding being reported for the file, e.g.:

     

    Attached files


    TestData_UTF-8.csv

     

    0
    Comment actions Permalink
  • Avatar
    Adrian Williams

    It would be constructive to have a small sample of the data that is causing the issue. 

    If required, you can use the Submit a request link to open a ticket and upload the data to us.

    0
    Comment actions Permalink
  • Avatar
    Stéphane O

    my source is a 3Gb file

    I've identified these lines :

     

    0
    Comment actions Permalink
  • Avatar
    Adrian Williams

    In Data3Sixty Analyze can you add another CSV/Delimited node to the canvas and configure it to import the data as before. However, switch to the Define tab in the node properties panel, scroll down to the bottom and add a third output pin to the node :

    When you run the node there will be a single record on the out3 pin which provides details of the results of the auto-detection process. Please post this information to us.

    The characters you indicate in your post are valid for the Windows-1282 code page and ISO-8859-1 character set so there should be no issue handling those characters in Analyze within either a unicode data type field or string type field.

     

    I created a text file with the data you indicate causes the issues (attached).  The view of the data in a Hex editor is this:

    The highlighted byte is the first e with the acute accent (0xE9). When this data is viewed in Notepad++ the characters are displayed as expected:

     

    Can you also let us know what locale your machine is configured to use.

     

    Attached files

    Identified_Lines_w_Extended_Chars.txt

     

    0
    Comment actions Permalink
  • Avatar
    Adrian Williams

    If you want to replace the problematic characters you could use the following in a Transform node:

    out1.field=in1.field.decode("ascii","replace")

    or

    out1.field=in1.field.decode("ascii","ignore")

    0
    Comment actions Permalink

Please sign in to leave a comment.



Powered by Zendesk