Login

Generating a Site Map for OnSmalltalk

OK, so any website that wants to be indexed well by Google (and those other guys) should be generating an XML sitemap for the search engines to index. A sitemap is nothing fancy, though it can get more complex if you choose to take advantage of more of its features; I prefer a simple version with everything marked as updated weekly.

I also prefer to invoke the generation of the sitemap manually and to generate it as a static file that Apache can serve up rather than having Seaside build one dynamically. My blog has an admin panel with a menu option to generate site map which invokes...

generateSiteMap
    | siteMap |
    siteMap := SBSiteMapGenerator blogRoot: 'http://onsmalltalk.com/'.
    siteMap generateFromItems: {  (SBPost new)  } , 
        (SBPost findAll: [ :e | e isPublished ]) , SBTag findAll.
    (siteMap pingGoogleWithMap: 'http://onsmalltalk.com/sitemap.xml') 
        ifTrue: [ self message: 'Map generated and Google notified successfully.' ]
        ifFalse: [ self message: 'Map generated but Google notification failed.' ]

The first item in the list, the empty new post, creates and item without a slug which represents the root of the site. I don't bother pinging the other search engines, the vast majority of my traffic comes from Google, the rest will find me eventually. So let's run through the generation of this sitemap, it's only a few methods. The class declaration...

Object subclass: #SBSiteMapGenerator
    instanceVariableNames: 'document root blogRoot'
    classVariableNames: ''
    poolDictionaries: ''
    category: 'OnSmalltalkBlog-Config'

A couple of accessors for the blog root...

blogRoot
    ^ blogRoot

blogRoot: aRoot
    blogRoot := aRoot

And a constructor that uses it...

blogRoot: aRootUrl 
    ^ self new
        blogRoot: aRootUrl;
        yourself

Since I'm going to write the sitemap to disk, I'll need to know where to put it, and I'll want it configurable...

siteMapPath
    ^ (FileDirectory
        on: (SSConfig at: #blogWebRoot default: FileDirectory default fullName))
        fullNameFor: 'sitemap.xml'

Now a method to generate the document, add the items to it, and write the file to disk...

generateFromItems: someItems
    document := XMLDocument new
        version: '1.0';
        encoding: 'UTF-8';
        yourself.
    root := (XMLElement named: 'urlset').
    root attributeAt: 'xmlns' put: 'http://www.sitemaps.org/schemas/sitemap/0.9'.
    root attributeAt: 'xmlns:xsi' put: 'http://www.w3.org/2001/XMLSchema-instance'.
    root attributeAt: 'xsi:schemaLocation' put: 'http://www.sitemaps.org/schemas/sitemap/0.9 
     http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd'.
    document addElement: root.
    someItems do: [ :e | self addItem: e ].
    FileStream fileNamed: self siteMapPath
        do: [ :f | f nextPutAll: document asString ]

For each item, I'll want to generate an entry. The item is expected to respond to two methods, #createdOn, and #slug. All of my posts and tags respond to these so I can just toss then into a single list of items...

addItem: anItem
    | url location lastModification isoString changeFreq |
    url := root addElement: (XMLElement named: 'url').
    location := url addElement: (XMLElement named: 'loc').
    location addContent: (XMLStringNode string: self blogRoot , anItem slug).
    changeFreq := url addElement: (XMLElement named: 'changefreq').
    changeFreq addContent: (XMLStringNode string: 'weekly').
    lastModification := url addElement: (XMLElement named: 'lastmod').
    isoString := String streamContents: 
        [ :stream | anItem updatedOn printOn: stream withLeadingSpace: false ].
    lastModification addContent: (XMLStringNode string: isoString).

With the file generated, we're ready to let Google know we've updated it...

pingGoogleWithMap: aMap 
    ^ (WAUrl new
        hostname: 'www.google.com';
        addToPath: 'webmasters/tools/ping';
        addParameter: 'sitemap' value: aMap;
        yourself) asString asUrl retrieveContents content 
        includesSubString: 'Sitemap Notification Received'

And that's it, Google knows the site's been changed and all of its valid URLs, and most of the time, is crawling the site within minutes, if not instantly.

I've got to say, I'm not missing Wordpress at all; it's a lot more fun just building your own blog.

Implementing Related Posts for OnSmalltalk

I found a few minutes to sit down and implement a simple related posts feature for the blog. Thought I'd take the simple method of just counting the number of tags the posts have in common, sorting them, dropping those with nothing in common, and grabbing the top x posts...

SBPost>>relatedPosts
    ^ (((((self class publicPosts copyWithout: self) 
        collect: [ :post | post -> (self tags count: [ :t | post tags includes: t ]) ])
          reject: [ :e | e value = 0 ])
            sortBy: [ :a :b | a value > b value ])
              collect: [ :e | e key ])
                 pageSize: self relatedPostCount page: 1

I was looking at it afterwards and thought something looked familiar about that algorithm. After stripping out two lines it became a bit more obvious...

SBPost>>relatedPosts
    ^ (((self class publicPosts copyWithout: self) 
        collect: [ :post | post -> (self tags count: [ :t | post tags includes: t ]) ])
          sortBy: [ :a :b | a value > b value ])
            collect: [ :e | e key ]

It's a Schwartzian Transform, named after our own Randal Schwartz (Squeak board member / one man Seaside evangelist). If you ever run into him, ask him about it, it's a funny story.

You could get the same result with just the sort block; a more naive implementation...

SBPost>>relatedPosts
    ^ ((self class publicPosts copyWithout: self) 
        sortBy: [:p1 :p2 | 
            (self tags count: [ :t | p1 tags includes: t ]) 
              > (self tags count: [ :t | p2 tags includes: t ]) ])

But you'd end up doing the tag count computation on each comparison, much more expensive than calculating once up front and then sorting.

By the way, a nice justification for re-inventing the wheel and writing your own blogging software, is to have a side project just for fun that lets you come up with excuses for features like this that don't feel like work and give you something to write about as well.

Clean URLs in Seaside

Seaside is known as a heretic framework when it comes to URLs, by default, they aren't very pretty. This is both a blessing and a curse. It speeds up development tremendously but confuses the crap out of your users who don't understand why they can't copy URLs and instant message or email them to you.

These URLs come from callbacks, but you don't want to get rid of all callbacks since they're a major part of what makes programming in Seaside so enjoyable by removing the need to manually marshal state in URLs. Once you get to the point where your app is working well enough that you are concerned about the URLs, you can identify those parts of your application that are mostly just navigation from one component to the next and start replacing callbacks with clean URLs encoding the necessary state in the URL like every other framework does. This works well for those more web page parts of your site where you don't really need complex callbacks anyway.

Doing clean URLs in Seaside isn't very difficult, but unlike using callbacks, how you pass state with them is rather application specific. Seaside doesn't have simplistic controllers that receive requests and dispatch to views, but an actual control tree that maintains state between requests in the session. Since the control tree is application specific and varies depending on the developer's personal style, the URL routing, which has to build or change parts of the control tree, is also necessarily application specific.

Basically there are two things you have to do, get rid of the _s and get rid of the _k. I picked up these ideas from the squeak-dev list when Adrian from cmsbox explained what they were doing. He didn't post any code, just a quick description of the method, but it was more than enough to get me going.

Getting rid of the _s is trivial, it's a configuration option on the application config page. Using that method however will cost you an initial redirect where Seaside sets a cookie and then redirects so it can detect the cookie on the following request. This leaves you without the _s but immediately leaves you sitting on a page with a _k and c in the URL when you haven't even done anything but hit the site root; it's ugly.

If you want clean URLs, at least for something as simple as page to page navigation where nothing fancy is going on you probably don't want this initial redirect, bots don't like it either. The fix is to not enable cookie sessions via the config, but to do it manually by tagging the response with the cookie on the way out if it isn't already there.

On your WASession subclass just override #returnResponse: with something like this:

returnResponse: aResponse 
    (self currentRequest cookieAt: self application handlerCookieName) 
        ifNil: [ aResponse addCookie: self sessionCookie ].
    ^ super returnResponse: aResponse

This adds the same cookie the config screen would without the redirect and thus without the initial ugly URL when a new session is instantiated.

We also have to remove the _s from generated callback URLs. Add another override to extend the behavior of #actionUrlForKey: to strip the _s when a session cookie is found:

actionUrlForKey: aString 
    | url |
    url := super actionUrlForKey: aString.
    (self currentRequest cookieAt: self application handlerCookieName) 
            ifNotNil: [ url parameters removeKey: self application handlerField ].
    ^ url

This takes care of the _s, you'll never see it again.

The _k is a little more interesting, so I'll use this blog as my example.

I tend to use a root component which acts as an outer frame and has an instance variable for the current body, header, and footer components. Sometimes some of this stuff in the root component might be expensive to get, so I don't want to have to do it more than once per session, or I just want it to persist between requests.

Normally when a request comes in without a _k, the current session will be invoked to create a new render loop main, which will be invoked to create a new instance of your root component and render it.

I want to avoid this--though this part isn't strictly necessary if you're OK with each request creating a new instance of your root--and keep the existing instance of the root component as well as parse the URL to decide what component should be loaded as the current body. This requires a custom #WARenderLoopMain subclass installed as the main class in the configuration.

This all starts at, go figure, #start: on the session class. So we'll override the default implementation with this:

start: aRequest 
    ^ self mainClass new
        blog: blog;
        start: aRequest

Here we see blog, which is an instance of the root component that I want to reuse. I'm simply passing on the root component instance to the custom #WARenderLoopMain subclass.

This means my session class needs to keep track of the root component, easy enough to do in the #initialize of the root component.

initialize
    super initialize.
    self session blog: self.
    currentBody := SBPostsView new.

Now the custom WARenderLoopMain subclass has the root component. A simple override of the #createRoot factory method allows me to return the same instance each time instead of creating a new one:

createRoot
    ^ blog ifNil: [ self rootClass new ]

At this point, I could override #start: and do all my URL parsing now, in the render loop, but I won't because I prefer to let each component parse the URL for itself taking its relevant state and loading itself up however it wishes. The default behavior of #start: already allows this by invoking #initialRequest: on each visible component.

Now, this won't be an initial request, but a subsequent request on an already initialized component; however, about the only thing I ever use #initialRequest: for is parsing URLs, so I'm happy to just treat each request without a _k as an #initialRequest: to a component.

#initialRequest: aRequest
    "parse aRequest url however you like"

Of course parsing your URL is quite naturally application specific so I'll leave this as an exercise to the reader.

At this point I just grab the path from the URL and do a quick search for blog posts with a matching URL slug. If one is found I load up that page as the current page, if not I check for any tags that match the slug and load up the posts in that tag. If nothing is found I issue a 404 status and render the home page.

The only thing left to do now is render the URLs cleanly instead of with callbacks in your render methods. Something like:

html anchor
    url: (self baseUrl addToPath: eachPost slug) asString;
    with: eachPost name

Note here that #baseUrl is not actually a method on a component but on the session. After some profiling I found #baseUrl to be a very expensive method to call and since it never really changes it pays off very well to cache it in my component base class:

baseUrl
    ^ (baseUrl ifNil: [ baseUrl := self session baseUrl ]) copy

Note that when I'm rendering the anchor I'm actually modifying the #baseUrl's path so the cached copy needs to return a copy of itself whenever it's used.

Anchors with callbacks, when clicked in Seaside result in two http requests to the server, the initial one which looks up the callback and invokes it, and a quick 302 redirect to the final URL to render that page. By not using callbacks and rendering ordinary URLs I'm bypassing the callback phase completely and loading up my state in #initialRequest: which eliminates one of the http requests. Combined with the caching of the #baseUrl, this is what makes the page navigation feel so snappy.

I'm sure some of this will likely change in 2.9 but at the moment I have no idea. In any case you get clean URLs with no parameters that are bookmarkable and won't confuse users trying to pass URLs around among themselves.

OnSmalltalk is Now Written In Smalltalk

OK, this blog is finally written in Seaside; no more chasing the latest Wordpress version. Just makes more sense for a blog mostly about Smalltalk and Seaside to be written in Smalltalk and Seaside. I haven't felt like writing for a long time so I'm hoping this change will make the blog more fun and get me paying attention to it again; she's been neglected for a while.

The code tops in at just under 800 lines so far for the Seaside, database and data migration, rss and atom feed, and google sitemap code. It could have been less if I didn't have to support the atom feed or deal with all the existing links or data from the old Wordpress site without breaking them, but that wasn't an option.

Hopefully, it's not full of bugs, but it supports clean URLs and a nicer threaded Ajax comment system. It'll be interesting to see how managing it compares to managing Wordpress. No more huge library of plugins or community to turn to, just my own simple Smalltalk code with the features I need and only the features I need. Just hope the spam doesn't kill me, guess I'll find out soon enough.

Scaling Seaside: More Advanced Load Balancing And Publishing

Seaside is a stateful application server and a Squeak VM is only capable of taking advantage of a single processor. When hosting a website on a multi processor server, or several servers, you need to load balance your requests across several instances of Squeak to fully utilize the hardware and handle more load than a single Squeak VM is capable of dealing with.

All serious Seaside applications that I'm aware of are front ended by Apache to offload the serving of static content like images, CSS files, JavaScript files, and static HTML while proxying only dynamic content to Seaside. Apache is an awesome platform and absolutely should be one of the tools is your toolbox; however, it can be quite challenging to setup in such a way that it makes load balancing and deploying new code seamless to your users.

I want to thank Avi Bryant for some tips about how DabbleDB handles this issue last year which gave me the basic approach and Lukas Renggli for some recent discussions that spurred me to finally spend some time on this issue and settle it for myself.

Previously I've used HAProxy, and when it became available moved to Apache's modproxybalancer module to load balance multiple Squeak VM's. I found both approaches lacking when it came to rolling out new code without being forced to blow away all existing user sessions when restarting Squeak with newly updated application code. Because of this I'd been rolling out code only during non peak hours when few users would be affected; however, I need to be able to roll out new code anytime and dynamically shift new sessions to the new servers while allowing the old VM's to remain up and running until their sessions expire and are no longer needed.

I've now abandoned modproxybalancer in favor of mod_rewrite and a few scripts which allowed me to tailor a custom solution that gives me much more control allowing seamless deployment of new code while allowing Apache to launch Squeak VM's as necessary and on the fly to handle load. The basic approach is as follows...

When a request comes in I use a rewrite condition to check for a cookie set by the Seaside telling me what server this request should be handled by.

If the cookie exists, I look up the server name in a rewrite map to find out what port that server is running on. That port feeds directly into another rewrite map that runs a bash script which uses netcat to see if the port is open or closed and returns the result.

verifyPort
#!/bin/bash
while read ARG; do
  nc -z localhost $ARG
  if [ $? = 0 ]; then
    echo open$ARG
  else
    echo closed
  fi
done

If the port is open, I proxy the request directly to the server with a rewrite rule when it sees the word "open" in the result.

If the port is closed or if no cookie is found, I proxy the request on a random port chosen from the rewrite map that contains the server mappings. This script also checks to see if the port is open and return the port number immediately if it is; however, if the port isn't open it launches a new Squeak VM on the specified port and waits until it finds the new port open before returning the port number and allowing Apache to proxy the request.

imageLauncher
#!/bin/bash
while read ARG; do
  nc -z localhost $ARG
  if (( $? != 0 )); then
    squeakvm -mmap 300m -headless -vm-sound-null -vm-display-null 
/var/squeak/app.image /var/squeak/startScript port $ARG &
    sleep 1
    nc -z localhost $ARG
    while (( $? != 0 )); do
      sleep 1
      nc -z localhost $ARG
    done
  fi
  echo $ARG
done

When a VM starts up, it's fed a script...

startScript
[[[ 60 seconds asDelay wait.
    WARegistry allSubInstances do: [ :e | e unregisterExpiredHandlers ].
    (WASession allSubInstances allSatisfy: [ :e | e expired ])
        ifTrue:
          [ SmalltalkImage current snapshot: false andQuit: true ] ]
        on: Error
        do: [ :error | error asDebugEmail ] ] repeat ]
        forkAt: Processor systemBackgroundPriority.

Project uiProcess suspend.

The script kicks off a background process that runs every 60 seconds and expires any sessions that are ready for expiration and shuts the image down when it finds all sessions have expired. It also kills the UI process which isn't needed on a headless server and just wastes CPU cycles if left running.

Not that this process starts off with an immediate 60 second delay, this allows ample time for the image to startup and start receiving its first request and establish a session. If you don't start with the delay, you'd see the image startup and immediately shut back down when it found it contained no active sessions (yes, I did this a few times before figuring out what the hell was happening).

Also, this being a low priority background process, I'm not concerned with using #allSubInstances as speed isn't relevant.

When a request hits Seaside for the first time, the root component of the Seaside app checks to see if the cookie is set and has the correct value. If the cookie is missing (new request) or the cookie has the wrong value (expired request for a VM that's no longer running and timed itself out) Seaside resets the cookie to the correct value so all future requests are served by that VM.

initialRequest: aRequest 
    self setServerCookie

setServerCookie
    | newVal cookieVal |
    cookieVal := self session currentRequest cookies at: #server ifAbsent: [ nil ].
    newVal := 'app' , (HttpService allInstances detect: [ :each | each isRunning ]) portNumber asString.
    cookieVal ~= newVal ifTrue: 
        [ self session redirectWithCookie: (WACookie key: #server value: newVal) ]

I keep three versions of the server mappings, the active version and an A version and a B version which I can simply copy over the active version to shift traffic to a new set of VM's which Apache will launch dynamically. A simple "cp balance.confA balance.conf" instantly shifts new traffic to the ports specified in the "ALL" mapping.

balance.conf
app3001 3001
app3002 3002
app3003 3003
app3004 3004
app3005 3005
app3006 3006
app3007 3007
app3008 3008
app3009 3009
app3010 3010
ALL 3001|3002|3003|3004|3005
balance.confA
app3001 3001
app3002 3002
app3003 3003
app3004 3004
app3005 3005
app3006 3006
app3007 3007
app3008 3008
app3009 3009
app3010 3010
ALL 3001|3002|3003|3004|3005
balance.confB
app3001 3001
app3002 3002
app3003 3003
app3004 3004
app3005 3005
app3006 3006
app3007 3007
app3008 3008
app3009 3009
app3010 3010
ALL 3006|3007|3008|3009|3010

Though the port number is part of the cookie value, I don't use that directly, it's simply a convenient way to name the app images dynamically. A user could fake a cookie value and looking it up in the mapping to find the port ensures I maintain full control over what ports images are launched on.

And finally here's a full production Apache configuration I use to invoke this setup with all the bells and whistles I currently use in my production sites...

Apache Config
<VirtualHost *:80>
    ServerName www.yoursite.com
    DocumentRoot /var/www/www.yoursite.com
    RewriteEngine On
    ProxyRequests Off
    ProxyPreserveHost On
    UseCanonicalName Off

    # tag cachable items with an expiry date for browsers
    ExpiresActive on
    ExpiresByType text/css A864000
    ExpiresByType text/javascript A864000
    ExpiresByType application/x-javascript A864000
    ExpiresByType image/gif A864000
    FileETag none

    # rewrite maps for managing Seaside application pool
    RewriteMap SQUEAK prg:/var/squeak/imageLauncher
    RewriteMap PORTS prg:/var/squeak/verifyPort
    RewriteMap SERVERS rnd:/var/squeak/balance.conf

    # http compression
    DeflateCompressionLevel 9
    SetOutputFilter DEFLATE
    AddOutputFilterByType DEFLATE text/html text/plain text/xml application/xml 
application/xhtml+xml text/javascript text/css
    BrowserMatch ^Mozilla/4 gzip-only-text/html
    BrowserMatch ^Mozilla/4.0[678] no-gzip
    BrowserMatch \bMSIE !no-gzip !gzip-only-text/html



    # Logfiles
    CustomLog /var/log/apache2/www.yoursite.com.access.log combined
    ErrorLog  /var/log/apache2/www.yoursite.com.error.log

    # Let apache serve any static files NOW
    RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_FILENAME} -f
    RewriteRule (.*) $1 [L]

    #if the cookie has the server in it and the port is open, proxy to it
    RewriteCond %{HTTP_COOKIE} "server=(app[0-9]*)"
    RewriteCond ${PORTS:${SERVERS:%1}} open(.*)
    RewriteRule ^/v6(.*)$ http://localhost:%1/seaside/app$1 [P,L]

    #otherwise proxy to a random Seaside server launching one if necessary
    RewriteRule ^/v6(.*)$ http://localhost:${SQUEAK:${SERVERS:ALL}}/seaside/app$1 [P,L]
</VirtualHost>

And in my httpd.conf I specify a rewrite lock to ensure Apache doesn't have concurrency issues with any of the rewrite map scripts.

RewriteLock /var/squeak/squeak.lock

And that's it. If anyone sees any problems with this approach (Avi, Lukas, or random Apache guru) I'd appreciate a heads up. I ran it by some folks in the #apache channel in IRC and they seemed to think it was fine. I've had this in production for a little bit now and so far it's kicking ass and I'm now free to increase or decrease my pool size dynamically or shift to a whole new set of images when publishing code.

<< 1 2 3 4 5 6 7 8 9 10 >>
about me|my squeak image|good books|popular posts|atom|rss