I've had a persistant problem where I have needed to handle large blobs of HTML and perform batch operations such as updating all of the image links, or accessing various elements of the HTML to update attributes. Well, there were several ways that were common for getting things done when working with HTML, but I would hardly consider them easy or effective due to the amount of work you had to do. Using Regular Expressions made things easier and was faster, however it still wasn't 100% effective in finding and changing elements and attributes within the HTML due to many conditions (broken tags for example).
My main problem has been "WYSIWYG" content-manager utilities and components. These are components that allow you to edit HTML within a textbox on an active website. All of the products I have tried (including my current) have pathing issues. Whether the issues are caused by you instanciating the control outside the root of the website, or you are on a 2003 Server and it is incompatible with the IIS6 metabase, it is always a constant problem when the control tries to re-path your local path into a URL. I have yet to find a control that gives you an override property, allowing you to just simply specify the URL you want inserted for images, regardless of the local path you used to upload and save the image(s).
My solution came at some cost in research time, however I found that by importing and using the Microsoft.mshtml assembly within .NET, you can actually load an HTML fragment or document and deal with it as if it were an object. Apparently the Microsoft.mshtml assembly is a little known .DLL that ships with IE 5.5 and functions as IE's HTML parsing and rendering engine. It provides a rich set of text editing and Web authoring features, enabling host applications to support a fully WYSIWYG HTML editing experience (that was from Microsoft- the little I can find on it that is useful).
Some caveats to using Microsoft.mshtml:
- You must type "using mshtml" instead of "using Microsoft.mshtml" (you'd of course expect the latter).
- The .DLL is 7 megs and normally does NOT copy local, so you must make sure to deploy this .DLL with your project, as I didn't find it readily accessible on 2003 Server.
- The size alone of the .DLL may shy some people away from deploying it. Of course, you only need to copy it to the target server once.
Currently I use mshtml to re-write the image paths of the content that comes out of my WYSIWYG editor each time it is saved. For some reason the editor always wants to re-path the images two levels higher as a prefix to all images. So i must make sure all image paths are reconstructed and correct every time the content is saved.
Also, if I am storing content pertaining to e-mailing jobs, I must make sure all image links are created as URLs versus local paths, so I must re-write the pathing each time a dynamic email is saved. This technique comes in very handy, because to date I have still not seen an override on any editor control which allows for URL pathing instead of local pathing. It suprises me that these companies do not take into consideration that their clients may use the control for email campaigns as well.
Here is a sample method utilizing the mshtml assembly:
public static string RewriteImagesURLForContent(string markup, string newURL)
{
//string _tagMatch = "img";
string _attributeMatch = "src";
string _imgAttribute = string.Empty;
string _imgName = string.Empty;
if (markup.Length > 0)
{
//reads the html into an html document to enable parsing
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write (new object [] {markup});
doc.close ();
//loops through each element in the document to check if it qualifies //for the attributes to be set
//foreach(IHTMLElement el in (IHTMLElementCollection)doc.body.all)
foreach(IHTMLElement el in (IHTMLElementCollection)doc.images)
{
//Get the current Name of the image from the existing attribute
_imgAttribute = el.getAttribute(_attributeMatch, 0).ToString();
_imgName =_imgAttribute.Substring(_imgAttribute.LastIndexOf("/") + 1);
//Set the new attribute = the url/path we need for linking
el.setAttribute(_attributeMatch, newURL + _imgName, 0);
}
markup = doc.body.outerHTML;
}
return markup;
}
|