Microsoft Word Document HTML Cleanup in PHP


I had to cleanup a HTML created out of MS Word Document manually.  Honestly it is a pain to manually search and replace all the junks Word Document generating.  So I have written a text conversion function in PHP to automatically cleanup the MS Word junks and output HTML entities.

Also when pasting from Microsoft Word into a web form if you just do not like to relies on TinyMCE or FCKEditor “Paste From Word" feature which does not seem to work most of the time it is a simple server side solution to strip an replace Word formatting for clean HTML output.


<?php
function word_cleanup ($str)
{
    
$pattern "/<(\w+)>(\s|&nbsp;)*<\/\1>/";
    
$str preg_replace($pattern''$str);
    return 
mb_convert_encoding($str'HTML-ENTITIES''UTF-8');
}
?>
Anonymous's picture

Remove junk from wordfile being parsed by PHP

Have you tried to search for a word in a MS Word file? I tried to use your parser with little success. (There was no error using it ;-)) It just didn´t remove the header and footer of the document.