Strip HTML
I have a flow that is processing information from emails.. one of the fields is the BODY which is all HTML formatted.. I need to strip the HTML out of it so I have just the text.
Its not really simple html so im not sure its possible...
Thanks!
-
I use the regex code example from this Stackoverflow response for stripping HTML: https://stackoverflow.com/a/12982689
I embedded the import and function within a Transform node's ConfigureField property and then call the function as-needed from the same Transform node's ProcessRecords property.
-
You could put functions in the ProcessRecords property too, but that means the function would get re-defined each time a record is processed. The ConfigureFields property is only processed once per node execution, so it'll define the function once. Either works but putting functions in ConfigureFields would yield better performance. The same holds true for importing libraries.
Also, if a function is within ConfigureFields, it would need to be defined prior to its first call. The code is processed sequentially and will throw an error if the function is defined at the bottom of ConfigureFields but called earlier.
-
ok ... one more.. im not great with regex stuff... in an email it has:
<!--
@font-face
{font-family:"Cambria Math"}
@font-face
{font-family:Calibri}
@font-face
{font-family:"Segoe UI"}
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
a:link, span.MsoHyperlink
{color:#0563C1;
text-decoration:underline}
.MsoChpDefault
{font-size:10.0pt}
@page WordSection1
{margin:1.0in 1.0in 1.0in 1.0in}
div.WordSection1
{}
-->so how can a use re to remove enything between <!-- -->?
i tried this:
cleanr = re.compile('<!--.*?-->')
didnt work
Please sign in to leave a comment.
Comments
6 comments