A workflow for periodically importing data from a remote host

The usage of data hosted in remote sources has become very common due the raise of APIs and web services. Everything is interconnected and its not wrong to say or assume that every software or application at some level utilises some data located on a remote source. APIs are usually available when there is a web service running, but there are times when you want to use a remote computer strictly for storage purpose only.

I was recently working on a project in which I had to read and write data from a remote computer. There was no API available to read and write data to that computer. Through this article, I will explain how I came up with a solution for this problem.

As shown below, I had 2 servers. I was running a web application on Application Server, and had to read and write some XML files to Storage Server. Both servers had linux running.

network-setup

To enable communication between 2 servers, I could run a service on storage server and allow application server to communicate with it using some protocol. I can even run another web application (API) on storage server and allow application server to communicate via HTTP. Running a web application requires installing a web server software (nginx or apache), and it becomes a maintenance headache and other issues like authentication had to considered.

I was looking for an approach in which I could use ssh for authentication. While searching for options I came across sshfs, which allows to mount a remote computers files on the local one. This provides a simple and secure solution. sshfs had everything I needed.

Setup process

I don’t plan to use passwords for SSH authentication, hence the first step was to setup key-based authentication for SSH.

Key-based SSH Authentication

The first step for key-based authentication is to generate a private and public key-pairs (If you don’t have one). We do this at application sever.

ssh-keygen

The second step is to copy the public key to storage server. There are couple of ways you can copy ssh public key to a remote server. My personal favorite is by using ssh-copy-id command.

ssh-copy-id username@storage-server.local

Now we are good to go. Once authentication is done, next step is to install sshfs.

Setup sshfs

The first step is to install sshfs to application server. Since I am using Ubuntu, I will be using the in-built package manager apt-get.

sudo apt-get install sshfs

Second step is to create a folder on application server to store/mirror the data on storage server. I will be creating a folder called “shared-xml” in my home directory.

mkdir ~/shared-xml

It’s also import to add the current logged in user to fuse group to avoid permission problems.

addgroup haris fuse

The next step is to activate sshfs. I am mounting the /xml/ folder on storage server to /home/haris/shared-xml/ of application server.

sshfs username@storage-server.local:/xml/ /home/haris/shared-xml -C -o allow_other

The allow_other option allows non rooted users to have read and write access to this folder. If you do not require that option you can remove it. It’s disabled by default.

We are done with setting up sshfs now. The next concern is what would happen If we restart the application server? Obviously the folder would be unmounted. So to automatically mount the folder on boot, we need to add the following instruction to /etc/fstab file.

username@storage-server.local:/xml/ /home/haris/shared-xml fuse.sshfs defaults,_netdev 0 0

Now we are all done with the process. When we reboot, the system will automatically mount the folder for us. Now the application can easily read and write to remote server by reading and writing to the mount on local system: /home/haris/shared-xml

Periodic workflow

I just solved one part of my problem. The next part is to efficiently utilise those data in the web application running on application server.

For performance reasons, I am storing the contents of the XML files in a relational database. And the goal here is to update the database when new data is available. To do this, I run a cron job every 1 hour to check for new data from the XML files.

cron-process

The process is pretty straight forward. A script is executed every 1 hour, and the script contains instructions to check If new data is available.

The script first looks for files that has been modified in the past 1 hour. The script makes use of file modified time for this. We can either update those entries in the database or make the script much more efficient.

The problem with file modified time is that, the modified time is updated regardless of the content modification. So If you have a list of files that has been modified in the past 1 hour, those files might not actually contain new data.

To solve this issue I used an md5 hash of the file. The hash will change only If the content is modified. So the pseudo-code for the script looks something similar to the following.

# Get a list of modified files in the past 1 hour using file modified time
modified_files = files that has bee modified in the past 1 hour

# Iterate the modified files
foreach file in modified_files
    # Retrieve the relevant record from database
    record = db.getRecord(file)
    
    # Check if the md5 hash of the file is different from the md5 hash stored in the database
    if md5_hash(file) != record.md5_hash
        # If the hash is different, update the record
        record.update(file)
    else
        # If the hash is same, move to the next file
        continue
endforeach

So this is pretty much it. If you have any ideas to improve the workflow, please leave a comment below. Thank you for reading.


	
	

Leave a Reply

Your email address will not be published. Required fields are marked *