Archive: 2014

  1. Halcyon days again

    Further to my post about the halcyon days of Sinclair design, I was rather excited to see that Rick Dickinson (designer of many of the original Sinclair computers) has been reimagining the Sinclair QL.

    QL black (5) QL white (6) QL white (10)

    From the drawing board of Rick Dickinson (DICKINSON ASSOCIATES)

    The design harks back to the original QL concept with a built in screen – although it’s rather an upgrade on the Sinclair TV that the early QL designs envisaged.

    QL Sketch, Rick Dickinson, from his Flickr set

    Original QL sketch, incorporating flat tube TV and printer, by Rick Dickinson

    Unfortunately, like the wafer design I was coveting, there’s no suggestion of actual development anytime soon – even though after a bit of research I realised that people are still working on a variation of the original Sinclair QDOS (the operating system of the QL): SMSQ/E stands ready.

  2. A web chat system for sale (one previous owner)

    It is at times when IT companies are bought for $19bn (WhatsApp, by Facebook) that I think back to systems we once developed that had a similar application. In this case, my mind has fixed on a web chat system we developed for a UK Government Department back in 2006.

    The primary purpose of this system was to allow the Department’s Permanent Secretary to hold question and answer sessions with his staff; but the system was developed so that it could be used by anyone within the organisation for whatever purpose they saw fit.

    A screenshot from a test of the web chat system in May 2007

    A screenshot from a test of the web chat system in May 2007

    Like much of the development work we were doing at that time, this was ground-breaking work for the environment in which it was to be hosted; and it felt like we were pushing the boundaries of what was possible, probably slightly beyond what was possible.

    The web chat system itself was written using a combination of .NET, ASP and Javascript (in Internet Explorer 6, or possibly 5.5 at the time), and was all managed via a custom-built XML interface on top of SQL Server. It utilised one of the hot technologies of the time—SOAP (Simple Object Access Protocol)—to pass messages between the client browser and server.

    The polling of new data was one of the most challenging aspects of the web chat application, which we solved at the time by running timers in JavaScript every few seconds to check the server for any new messages (since the last message the current client had received). Any messages were returned via SOAP as an XML document which was then processed; with the results inserted into the appropriate DOM element.

    From what I can remember there were a number of performance issues when too many messages were loaded onto the page at the same time, so we had to introduce a paging system to only hold so many messages on the screen at the same time.

    An extra layer of complexity was added at a relatively late date when the intranet team running the system decided they wanted to add some moderation, so that every message posted within a particular chat session would need to be approved before it could be displayed. I think this was a result of general nervousness as the date for the first live chat approached!

    [T]he most popular chats were associated with some of the various clubs and associations run within the Department

    The system was actually fairly flexible. Chat rooms could have specified start and end dates and times; and as well as the configurable moderation, could be set up to allow anonymous postings, and user notifications for when new messages were posted. The latter was via email, because we were 7 years too early for browser-based push notifications.

    I think the most popular chats were associated with some of the various clubs and associations run within the Department.

    OK, it wasn’t quite WhatsApp (which after all is tied very closely to phones and the modern telecoms infrastructure), but I’m sure it could have been developed further and been ready for the smartphone revolution when it came. In any case, I’d be happy with only a fraction of that $19bn.

  3. Automating Sphinx

    I’ve been working on a project to use the open source Sphinx search engine to index some websites. The intention is that the user will be able to type in the URL of the website they want to index, and the system will then (a) download the content of the site in question using a bespoke crawler and (b) automatically pipe the relevant parts of that content into Sphinx.

    This is a development version of the indexing system in operation

    The system will provide a front-end search engine for users to query the indexed content, with the results providing a summary of each page of content – including generated keywords; as well as report on the status of each page, including any relevant metadata. There will also be an API so that if required the content can be utilised in other systems.

    Generating the content to index

    Crawling any web content is something of an art form. Many sites do not conform to good HTTP practices – it amazes me, for example, that so many sites don’t use the last modified HTTP header to ease the burden on their servers. However, for the purposes of this article we’ll assume that the content we want can be fetched easily and that every page fetched can be processed to generate a nice XML fragment with all of the information we want for each page.

    So, the web crawler fetches the content and each URL encountered generates an XML fragment stored with the following structure:

    <document>
        <subject></subject>
        <url></url>
        <published></published>
        <description></description>
        <keywords></keywords>
        <status></status>
        <words>
            <word></word>
            <word></word>
            …
        </words>
        <content></content>
    </document>
    

    What we need to do then is to:

    1. turn the XML documents into a format suitable for importing into a Sphinx index;
    2. create an index for the site in Sphinx; and
    3. feed the information stored across the multiple XML files into the relevant index.

    Processing the XML files for Sphinx

    To get the content in the static XML files that have been generated for each URL into Sphinx, we need to generate an XML document that is compatible with Sphinx’s xmlpipe2 document format.

    The first step is to concatenate all of the individual XML files into one large XML file (making to sure to top and tail the concatenated files with some extra XML to make sure the final file is valid XML). This is handled by a simple ‘cat’ command.

    Then a separate XML file is generated in memory using a fairly straightforward XLST transformation – using:

    <Sphinx:document id="{1 + count(preceding-sibling::document)}"> 

    …to give each entry a unique sequential id.

    This XML file is only generated when it is called by the Sphinx configuration file. However, before that we need to create a configuration file that knows about each new site as it is added for indexing.

    Creating a Sphinx configuration file on the fly

    Fortunately, Sphinx configuration files can be written in more or less any scriptable language. The solution here is written in perl, partly because that’s what much of the other code for the system I am developing is written in, and partly because perl has the rather useful Sphinx::Config module which can parse, update and output Sphinx configuration files.

    Using Sphinx::Config the script reads in a standard Sphinx configuration file which I’ve set up with all of the correct connection details for my instance of Sphinx, but which doesn’t include any source or index information. [For more on configuring Sphinx, see the official documentation]

    my $filename = "sphinx.conf";
    my $c = Sphinx::Config->new();
    $c->parse($filename);

    I then iterate through each of the file-based folders the crawling process has created ($dir) to get the names of the indexes I need to create ($index), and then set the appropriate source and index entries using Sphinx::Config->set():

    $c->set('source',$index,'type','xmlpipe2');
    $c->set('source',$index,'xmlpipe_command','perl sphinx.xslt.pl xslt_file crawler/'.$dir.'.sphinx.xml');
    $c->set('index',$index,'source',$index);
    $c->set('index',$index,'path','/path/to/sphinx/'.$index);
    $c->set('index',$index,'morphology','stem_enru');
    $c->set('index',$index,'docinfo','extern');
    $c->set('index',$index,'charset_type','utf-8');
    $c->set('index',$index,'min_word_len','1');

    It is the xmlpipe_command that generates the transformed XML document that is to be fed into Sphinx. (The perl file sphinx.xlst.pl just carries out a standard XSLT transformation and outputs the result as a string).

    Finally I output the updated configuration as a string so that it can be read by Sphinx:

    print $c->as_string();
    

    The complete configuration script looks like this:

    #!/usr/bin/env perl
    
    use Sphinx::Config;
    use Cwd;
    
    # The name of the base configuration file 
    # - this needs to include all of the connection 
    
    # information for indexer and searchd
    my $filename = "sphinx.conf";
    
    # Load in the default configuration file
    my $c = Sphinx::Config->new();
    $c->parse($filename);
    
    # Set the location of all the crawled content
    my $root = cwd()."/crawler/";
    
    # Open the directory
    opendir my $dh, $root
      or die "$0: opendir: $!";
    
    # And read in each of the folders (excluding any starting with .)
    my @dirs = grep {-d "$root/$_" && ! /^\.{1,2}$/} readdir($dh);
    
    # Iterate through the folders and create a source 
    # and index entry for each
    foreach $dir (@dirs) {
            my $index = $dir;
            $index =~ s/\.//g;
            $c->set('source',$index,'type','xmlpipe2');
            $c->set('source',$index,'xmlpipe_command','perl sphinx.xslt.pl xslt_file crawler/'.$dir.'.sphinx.xml');
            $c->set('index',$index,'source',$index);
            $c->set('index',$index,'path','/path/to/sphinx/'.$index);
            $c->set('index',$index,'morphology','stem_enru');
            $c->set('index',$index,'docinfo','extern');
            $c->set('index',$index,'charset_type','utf-8');
            $c->set('index',$index,'min_word_len','1');
    }
    
    # Output the updated configuration so that 
    # it can be read by Sphinx
    print $c->as_string();

    So now I have a configuration file that reflects the content that has been crawled for indexing. However, as yet I haven’t provided a mechanism for letting Sphinx know that it needs to re-read this configuration file when a new site has been added.

    That’s not quite as straightforward as it might be.

    Updating Sphinx

    To step back slightly, I should say that the whole crawler/indexer process is being managed by a bash script. This is the easiest way to tie together the different processes and languages involved in getting everything to work. So it is a bash script which starts the crawling process and which, when it has finished, creates the standalone XML document (using the cat command).

    The next step in the bash process is to tell Sphinx to index. This is seemingly just a question of calling the indexer using the dynamic configuration file devised above along with the name of the new index and asking it to rotate:

    indexer --config sphinx.config.pl "$INDEX" --rotate

    If you’ve set your permissions correctly this will report a successful rotation AND that it has sent a SIGHUP to the searchd process that handles web-based queries so that it too can re-read the configuration file and learn about the new index.

    Except this is a lie. The index is rotated properly but searchd ends up knowing nothing about it. In fact, you have to send that signal independently:

    kill -SIGHUP `cat /path/to/your/searchd.pid`

    Except actually that doesn’t work either.

    That’s because although the Sphinx configuration file is scriptable and hence dynamic in terms of its output, the process that SIGHUP triggers to re-read the configuration file first does a check on the file date, and if it hasn’t changed it won’t bother re-reading it at all. So before calling kill -SIGHUP on the searchd.pid file you need to touch the configuration file first:

    touch /path/to/your/sphinx.config.pl

    Of course, it probably still won’t work, because the process running your bash script won’t have permission to execute the kill command on the searchd process. To get that to work you need to configure visudo. That’s mostly outside the scope of this article, but you’ll need something that looks a bit like this:

    daemon  ALL=NOPASSWD: /bin/kill -SIGHUP ${`cat /path/to/your/searchd.pid`}
    

    After that has been configured, everything should just work. On a good day, anyway.