I had to cleanup a HTML created out of MS Word Document manually. Honestly it is a pain to manually search and replace all the junks Word Document generating. So I have written a text conversion function in PHP to automatically cleanup the MS Word junks and output HTML entities.
Also when pasting from Microsoft Word into a web form if you just do not like to relies on TinyMCE or FCKEditor “Paste From Word" feature which does not seem to work most of the time it is a simple server side solution to strip an replace Word formatting for clean HTML output.
<?php
function word_cleanup ($str)
{
$pattern = "/<(\w+)>(\s| )*<\/\1>/";
$str = preg_replace($pattern, '', $str);
return mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8');
}
?>
Remove junk from wordfile being parsed by PHP