How To: Automated Feeds:
Very Advanced Text Feed from Multiple Directories with
File Name Filters and a Regular Expression Search
I'm going to assume that you already know how to use RSS DreamFeeder fairly
well. If you are not already familiar with it then I would recommend that you
first go through one of the other tutorials. If you're ready to proceed then
we should talk about the scenario you are going to be working with.
The feed you're going to build is a promotional feed of press releases for
both products and public relations. This will be an Automated RSS text feed
that you will use the Advanced interface to build.
In this scenario ACME has two groups that release content for the press. The
first is Product Management, who issue press releases related to new and updated
products. The second is Public Relations, who issue press content for promotional
purposes, including both press releases and press notes. They both have their
own directories which contain this content, in 04_ProductManagement/releases
and 06_PublicRelations/content respectively. A further examination of the PR/content
directory reveals that PR has mixed both releases and notes in the same directory.
If you are to build a feed with just press releases then you will need to filter
out the notes.

If you open an example release from both pm/releases and pr/content you'll
find that they look very similar and that key areas are the same. The template
region has the same name (PageContent) the headline uses the same tag (H3)
with no style (the style is overriding the normal display properties of the
H3 tag).


As I have said before, consistency in design is a key element of effectively
conveying information -- and templates and style sheets let you do that well.
Templates control page structure and style sheets control the graphical presentation
of content. These two tools allow content to be restricted in placement within
the document (template regions) and adherent to a predefined visual order (style
selection - also called classing).
Moreover, classing for styles (or dropping in a template region) is also tagging
with meta data. If a headline is classed HomePageHeadline because the style
sheet says so, then we can reverse that relationship and say that any text/data
with the class HomePageHeadline is a headline. What something looks like is
what it is and is also what it looks like.
You should also notice the names of the files. All releases are called WHATEVERRelease.html.
The PM folks and the PR folks have slightly different naming conventions, but
they both agree on calling a release Release.html. Enforcing naming conventions,
especially on a very large website, can be difficult but it is absolutely worth
while because you can then gather much information about content before the
file is even opened. No serious attempt at a large-scale website should forgo
naming conventions -- you'll suffer for it.
So start a new feed by pressing the new feed button in the RSS DreamFeeder
floating panel.

When the dialog is displayed you are presented with the basic
interface, so click on the Advanced tab.

The first panel of the Advanced interface provides fields for descriptive
content for the feed. The only required fields are Title and Description. Now
go to Feed Settings from the Category list.

In Feed Settings you will decide what type of a feed you are building (Text
Feed) and what file format to use (RSS 2.0). Then tell it to collect content
from Files and that you want to have your computer do the work of updating
the feed (Local Processing). Next, provide the Site Settings.

Under Site Settings you'll give RSS DreamFeeder the Base URL that it will
use to translate the local links to full URLs for the feed. Once entered move
on to Summarize.

Under Summarize you will define where the files reside within the website
that you want RSS DreamFeeder to extract content from. You'll be extracting
content from two directories so select Directories and then use the plus button
on the right to add pm/releases and pr/content to the list of directories.
Then tell it that the files names end with Release.html so that RSS DreamFeeder
will only grab the files that match the naming convention.



Under Elements you can decide which elements of the feed you are going to
include. But in this case we're going to stick to the basic set.

Now launch
the content sampler and sample the Headline and the Story (the whole PageContent
template region). Then Press the Done button to return to the Edit dialog.

When you return to the edit dialog you'll come back to the same Elements panel.
Now go to the panel for defining content extraction for Headline. You'll see
that it already has the H3 tag defined, but I like to be more precise if I
can be so that if the page changes and there is an H3 before this one or added
to the template or something I can still use these settings. The headline was
within the story, so restrict the location to "Within the Story".
Now move on to Story.

The Story settings from the Content Sampler are perfect so we will leave them
alone. On to Link.

Link's default setting is to use the location of the current page that we
are extracting from. That is exactly what you want to do here -- have the link
point back to the original file that RSS DreamFeeder is pulling content from.
So don't change that either. On to Date.

Now Date defaults to the Current Page's Modification Date, which is useful,
but not really what we're after. If a page is modified, even for something
as simple as fixing a typo then the modification date will be off. So the right
answer is to extract the date from the dateline text in the document. This
is where a Dreamweaver datestamp would have proven useful (and is an option
in Match Type popup menu) but the authors of the page didn't provide one. So
there is only one final option - an advanced text search called a Regular Expression.
Now there may be multiple dates on a page so to be sure to find the right one
look for the first one after the Headline.
You want to match the dateline string that looks like
May 27, 2009
WORD-SPACE-NUMBER-COMMA-SPACE-TWOTHOUSANDSOMETHING
In regular expression there are special strings that mean a particular character:
a word character (\w); a space character (\s); a digit character (\d). These
strings are usually then modified to indicate how many characters to include:
zero or more (*); one or more(+); zero or one [maybe there maybe not] (?).
Any characters that are not these characters (plus some others) are what they
are: a means an a; R means an R; 2 means a 2. Of course this is just the tip
of the regular expression iceberg and you can learn lots more about it in the
documentation or by searching online. Regular expressions are one of the most
powerful text manipulation tools you can use and its part of RSS DreamFeeder.
So back to the match. To match something that looks like this
May 27, 2009
WORD-SPACE-NUMBER-COMMA(maybe)-SPACE-TWOTHOUSANDSOMETHING
\w+ \s+ \d+ ,? \s+ 20\d+
\w+\s+\d+,?\s+20\d+

The last line above is the final regular expression you want to use. If for
any reason it didn't match the modification date of the file (the original
default) will be used instead. With something like this, where even a small
typo can happen easily and get you into a world of trouble you have got to
test it out. Point the Test tool at one of your press releases (I used 090528TastyRelease.html
from pr/content) and give it a shot.

Now on to the Author element. In this scenario the Author should always be
the same thing (a Fixed Value): the text "ACME F&N".

The configuration is now complete so press Save and save it as releases.rss
in the root directory.

You can see that there are 5 files to check -- more than in either directory
alone. That means that it is finding content to collect in both directories.

Process the feed and try it in your news reader.


Congratulations -- You have created a very advanced RSS
feed.
If you're interested you may choose to proceed to another tutorial::
|