This regular expression stuff is pretty cool

I had to give up using The Regulator, that regular expression test bench. Being a .NET thing, its regexes are almost Perl-compatible, in almost the same way PHP's regex functions are almost Perl compatible. What I worked out with The Regulator was seriously close, and I got there much faster than I would have without such a learning environment.

I'm glad it was free though, because I don't need it anymore. My PHP (and Perl and Python, if I ever have to) IDE is Komodo, which has a nice debugger that I can use to test these beasties now that I have a grip on them. Best of all, I now have the same open source regex engine as in PHP 4 in the form of a Windows DLL, so what I learn here will be directly applicable to my Windows stuff.

The point of all this is data scrubbing. Older incarnations of P6 had all manner of embedded CSS, laid on top of the current CSS, renders major sections of the text illegible. And I have graphics scattered between three directories and some storage space at Earthlink.net. By the time I realized it, regenerating the whole site was simply not worth dealing with my web host's complaints about hogging the CPU. With the move to Drupal, which generates pages dynamically, I've decided it's an ideal time to fix stuff. So I'm replacing all the offending styled <div> tags with simple <blockquote> tags. I'm also scanning the whole site for image tags. I want to identify all that live in spaces that belong to me, move them someplace sensibly organized and rewrite the tags. Finally, I get to tune my .htaccess file.

Which leads me to a slightly contrarian position.

At first single post archives under Movable Type 2.x were named [message no.].html or some such. Someone figured out that dirify thing and suddenly we all have files named the_title_of_the_post.php or some such. I'm not sure that was an improvement.

Say after I post this message I decide the title should have been "These regular expressions are pretty cool." Before the dirify trick the original file is replaced. After the dirify trick a new file with a name constructed from the new title is created. The old file is still there though, just orphaned. So if you've ever edited a title in your blog posts you've got these ghost files laying about.

Plus setting up the redirects will be no joke.

Posted by Prometheus 6 on August 3, 2004 - 3:18pm :: Tech
 
 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Some have done custom 404 scripts that look directly into the MT data store (much easier if you are using MySQL) to generate the new url from the post id. But that sounds like work ;-)

Posted by  Pat (not verified) on August 3, 2004 - 5:08pm.

Plus, if you have two posts on the same day that have a similar name, like the words that are used in the url are all the same? It screws everything right up.

Posted by  drublood (not verified) on August 4, 2004 - 8:28am.

This is why I wouldn't go near dirify. Besides, those truncated post titles look garbled and silly. Like the text of a Bush speech.

Posted by  Waveflux (not verified) on August 4, 2004 - 10:49am.

You'll note my titles aren't truncated for that very reason: I do several Title Part 1, Title Part 2 things per month.

I know how the redirects can be handled. There are adjustments I can make during my data cleanup phase. Drupal lets you create aliases for any post so the trick is to set up dirify-ed aliases. The same sort of script I'm using to clean up the styled <div>s can handle setting up the aliases.

Posted by  P6 (not verified) on August 4, 2004 - 1:05pm.