2016년 12월 10일 토요일

How to create a mirror of the entire npm index including attachments using npm-fullfat-registry fullfat.js

In a previous post I attempted to create a local npm mirror by simply caching all packages I installed using npm install pkgname. Unfortunately, this approach is very slow, downloading only about 100 MB per hour. Considering that all the packages in npm take up more than 1.2 TB, this speed won't do. The recommended method for creating a local npm mirror is to use the nosql couchdb. In this method, packages from npm are stored directly in couchdb.

Step One

Install couchdb from your distro's package manager or download from the Apache CouchDB page and build from source. As of Dec 10, 2016 version 1.6 is available from the Fedora 24 official repos while version 2.0+ is available from the Archlinux repos.

Once couchdb is installed, start the couchdb service. On Fedora 24+ and Archlinux, you can do this with:

sudo systemctl start couchdb


Step Two

Access Fauxton, the web GUI for couchdb by navigating to:

http://localhost:5984/_utils

Follow the prompts to create an admin user and password. If you don't create an admin user, anyone connecting to localhost:5984 will be able to create and delete databases. Also create a new database by clicking on the gear icon at the top-left. The screenshot below shows the Fauxton UI for couchdb 1.6:




Step Three

Make sure that npm (nodejs package manager) is installed and then create a new directory into which you will install the npm package npm-fullfat-registry. Then from that directory, run as local user:

npm install npm-fullfat-registry

You will then find a sub-directory named node_modules and below that npm-fullfat-registry/bin.

cd node_modules/npm-fullfat-registry/bin

In this sub-directory you will find a single file named fullfat.js

This is the program you need to execute in order to create a local npm mirror, assuming you have already installed couchdb and have created a DB for this program to write to.

fullfat.js takes the following arguments:

-f or --fat : the url to the couchdb database for storing packages
-s or --skim : the url to the npm package index
--seq-file : file which keeps track of the current package being downloaded from npm
--missing-log : file which stores the names and sequence numbers of packages that cannot be found

To save myself the hassle of entering these parameters every time I want to invoke fullfat.js, I created a convenience script in Bash:

#!/bin/bash
# fullfat.sh
# Last Updated: 2016-11-08
# Jun Go
# Invokes fullfat.js for creating a local npm mirror containing
# npm index as well as attachments. This script is intended to
# be launched by 'npm-fullfat-helper.sh'
LOCALDB=http://user:pw@localhost:5984/registry
SKIMDB=https://skimdb.npmjs.com/registry
./fullfat.js -f $LOCALDB -s $SKIMDB --seq-file=registry.seq \
             --missing-log=missing.log

Of course you will need to edit the username and password for accessing Fauxton. The script above works, but it is not sufficient. fullfat.js crashes every so often so I created a monitor script to restart my fullfat.sh wrapper script whenever fullfat.js crashes. My monitor script is named npm-fullfat-helper.sh:

#!/bin/bash
# npm-fullfat-helper.sh
# Last Updated: 2016-11-08
# Jun Go

# During the mirroring process for npm, binary file attachments
# are saved into a local couchdb DB named 'registry', but sometimes
# downloading some packages fails or times out, which stops the
# entire process. If you manually resume with fullfat.js, you can
# start again where you left off. This script removes the need to
# do this manually.

until ./fullfat.sh; do
  printf "%s\n" "fullfat.js crashed with exit code $?. Respawning" >&2
  sleep 1
done

Using the script above, mirroring npm with fullfat.js becomes much more robust as it will be re-launched if the process returns anything other than exit code 0. But this is still not sufficient, because sometimes fullfat.js gets stuck while trying to download certain packages. No matter how many times it is restarted, certain packages (especially those with dozens of versions) never complete downloading, so you will be left with a lot of tar.gz files in temp directories but no final PUT command to couchdb. When this happens you have to manually edit the sequence file (which keeps track of which package is currently being downloaded). For example, if fullfat.js is stuck and registry.seq contains the number 864117, you must increment the number by 1 to 864117. Then if you relaunch the monitor script, fullfat.js should go on to the next package. If the package name is still unchanged, edit registry.seq once more by incrementing the new sequence by one.


Conclusion

Mirroring the npm index with file attachments using couchdb is much faster than simply caching packages installed through npm install foo. I get speeds of about 1 GB/hr. The problem is that manual intervention is required when fullfat.js gets stuck, i.e. you must manually change the sequence number stored in the sequence file (which I called registry.seq above) so that fullfat.js will skip a problematic package and go on to the next one. Another inconvenience is that as of Dec 12, 2016 the documentation for npm-fullfat-registry has not been updated. If you follow along with these old instructions, you will be told to invoke the following:

npm-fullfat-registry -f [url to local db] -s [url to npm package index]

But this won't work because the program you need to actually execute is fullfat.js. Simply replace npm-fullfat-registry above with fullfat.js and you'll be good to go. Keep in mind that the path to fullfat.js is node_modules/npm-fullfat-registry/bin in the directory into which you invoked npm install npm-fullfat-registry.