So you discovered ElasticSearch, great! A question that should come to mind once you start using it seriously is: can I create backups of my indexes. Luckily you can! There are several options that are not very user friendly.
First, you can make filesystem level backups of the indexes, but with a big cluster this means you have to copy the data on each node.
You can also use a shared gateway and backup the data from the gateway. I would not advise a shared gateway because the whole point of ElasticSearch is not having a single point of failure. Actually, the shared gateway is deprecated by ElasticSearch now so don’t even think about exploring that option now!
The third option is to use the scan and scroll API call that ElasticSearch offers. These two calls allow you to scan all (or a subset) of your data and walk over the result set by repeatedly calling scroll. I have tested this on quite some data (200GB) and this works surprisingly well. That is why I decided to add a dump and import script to my open source project ESClient (Python), to save you from the trouble of having to reinvent the wheel
If you install ESClient (with pip install esclient or easy_install esclient) you get these two scripts installed automatically. You can use them by simple entering esdump or esimport on the command line and they will show you usage information.
As an example, suppose you have an index called ‘items’ and another called ‘customers’. You can backup this index to a bz2 file using:
esdump --url http://localhost:9200/ --indexes items customers --bzip2 --file items_customers.bz2
You can import this data using:
esimport --url http://localhost:9200 --file items_customers.bz2
Alternatively, you can import the data back to another index, e.g. items_test, by using the –index option on esimport.
These two scripts currently support indexes that have the following fields: _parent, _routing. If you supplied a specific routing at index time, that will be restored too. The same holds true if you specified a parent/child relation.
Not supported are indexes in which you don’t store the _source field. You can not backup an index without this field.
It is relatively simple to also backup the mapping of the data, so this is high on my priority list. Also, I want to check the cluster state before dumping the data, to ensure you are not backup up a cluster that is in a bad state (Yellow or Red).
P.S.: from what I understand, ElasticSearch.org is working hard towards a 1.0 version which will offer backup and restore functionality out of the box!