We recommend switching to the latest versions of Edge, Firefox, Chrome or Safari. Using Internet Explorer will result in a loss of website functionality.

Strip HTML

Comments

6 comments

  • Avatar
    Gerard Cafaro

    I use the regex code example from this Stackoverflow response for stripping HTML: https://stackoverflow.com/a/12982689

    I embedded the import and function within a Transform node's ConfigureField property and then call the function as-needed from the same Transform node's ProcessRecords property.

     

    0
    Comment actions Permalink
  • Avatar
    Geoff

    oh! you put functions in the config section!? that may be my mistake... THANKS!

    0
    Comment actions Permalink
  • Avatar
    Gerard Cafaro

    You could put functions in the ProcessRecords property too, but that means the function would get re-defined each time a record is processed. The ConfigureFields property is only processed once per node execution, so it'll define the function once. Either works but putting functions in ConfigureFields would yield better performance. The same holds true for importing libraries. 

    Also, if a function is within ConfigureFields, it would need to be defined prior to its first call. The code is processed sequentially and will throw an error if the function is defined at the bottom of ConfigureFields but called earlier.

    0
    Comment actions Permalink
  • Avatar
    Geoff

    Unable to process input field (4) 'BODY' as a 'str' type. Error seen on record 2, on input (0) 'in1'.

    hmm... 

     

    EDIT: if I flip it back to Unicode it works... THANKS!!!

    0
    Comment actions Permalink
  • Avatar
    Gerard Cafaro

    I had a comment half-typed about str vs unicode - I see you edited note with the same findings. Glad to hear you got it working! It's recommended to use unicode across the board when dealing with strings / text. 

    0
    Comment actions Permalink
  • Avatar
    Geoff

    ok ... one more.. im not great with regex stuff... in an email it has:

    <!--
    @font-face
    {font-family:"Cambria Math"}
    @font-face
    {font-family:Calibri}
    @font-face
    {font-family:"Segoe UI"}
    p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0in;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif}
    a:link, span.MsoHyperlink
    {color:#0563C1;
    text-decoration:underline}
    .MsoChpDefault
    {font-size:10.0pt}
    @page WordSection1
    {margin:1.0in 1.0in 1.0in 1.0in}
    div.WordSection1
    {}
    -->

    so how can a use re to remove enything between <!-- -->?

    i tried this: 

    cleanr = re.compile('<!--.*?-->')  

    didnt work

     

     

    0
    Comment actions Permalink

Please sign in to leave a comment.