View Single Post
Old 27-Feb-2007, 04:29 AM   #1 (permalink)
Sangeetha
Fixed Error!
 
Sangeetha's Avatar

Posts: 139
Location: Chennai
Join Date: Feb 2007
Rep Power: 2 Sangeetha is on a distinguished road

IM:
Default Regex to match certain HTML attributes

Regexes are sometimes quite challenging. I've been banging my head on this table for hours now and want to stop. Please help me get rid of this headache!

I need to remove all style and class attributes in an HTML file whilst leaving all other attributes untouched. I just need the regex for this - I've written a generic filter that uses the Regex, but I just can't seem to get this one to work (I'm failing to get the regex to ignore other attributes between the tag and the style=...).

Given the following HTML (which came from pasting from the trully awful MS Werd - I really couldn't invent this rubbish if I tried!):

<H1 style="MARGIN: 0cm 0cm 0pt"><FONT color=#000000>blah blah<SPAN style="mso-spacerun: yes">&nbsp; </SPAN></font></H1>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt; TEXT-ALIGN: justify"><?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-comffice:smarttags" /><st1:PlaceName w:st="on"><SPAN style="FONT-SIZE: 10pt; COLOR: #ff9900; FONT-FAMILY: 'Century Gothic'">blah blah</SPAN></st1:PlaceName><SPAN style="FONT-SIZE: 10pt; COLOR: #ff9900; FONT-FAMILY: 'Century Gothic'">

I need just the Regex and the Replacement strings. It should:
- remove (match) style and class attributes
- work with and without quotes - note that 'Century Gothic' is wrapped with single quotes
- assume the attribute quotes are "double" (or missing)
- the attributes must be allowed to be in *any* order in the tag
- all other attributes and tags must be left in situ

I've other regexes that clean the rest of the vomit - at least ten of them!

For a bonus, if anyone has the name of the idiot who created the Werd HTML engine..... I'd just love to write to his/her mother and tell her how her child is messing with people's heads
Sangeetha is offline   Reply With Quote