FrontPage RecentChanges AboutWikiFeatures WikiNode

FailSafeWiki

FailSafeWiki

A wiki where users don’t even notice when any one machine is unplugged.

FailSafeWiki – the idea that you might have redundancy between wiki, one goes down, the next comes on-line.”

Possibilities

master / slave architecture

One could implement it by having a “master” wiki that runs normally, and a “slave” wiki that no users talk to, but keeps its files synchronized with the master wiki.

When the master wiki goes offline for any reason, one of the “slave” wikis becomes the new master wiki, and users are redirected to that new master wiki.

"master copy"

One could implement it this way:

For each *page* of the wiki, one particular machine keeps the “master copy” of that page, and all others keep a “slave copy” (is that the right term ?).

If the wiki was distributed over 10 machines, then any one machine holds the “master copy” of roughly 1/10 of all the pages in the wiki, and a “slave copy” of all the other pages.

If *any* machine goes offline, then about 1/10 of the pages become “locked” until the remaining machines come to a consensus about which machine(s) should become the new master for those pages.

Load is distributed as *readers* use whichever machine is geographically closest.

Load is distribed as *editors* are re-directed to whichever machine holds the master copy of the page that editor is interested in.

"like RAID 5"

Just like RAID 5, except instead of putting 4 (or more) hard drives in a single box, they’re in separate cities.

When a user edits a wiki page, both the “master” site for that page and the “parity” site updated.

If the wiki was distributed over 10 machines, then any one machine holds the “master copy” of roughly 1/9 of all the pages on the wiki, and the “parity copy” of another 1/90 of all those pages. ( 2/90 for RAID 6 ).

distributed-replication software

Use file-distribution software that has already been written (I assume they are “like RAID 5” in some ways).

Perhaps even take advantage of a pre-existing network -- then each node not only helps back up this wiki, but also other kinds of files … perhaps even a few other completely unrelated wiki.

order of succession

With any of the above implementations, rather than trying come to consensus on who is the “new master” after one master fails, choose the order of succession before the master fails. Let master pick the next 2 or so successors arbitrarily, and inform its slaves.

Implementations

As of 2004-05-19, I don’t think this has been implemented on any wiki.

Wikipedia has implemented a big fraction of this feature.

So, does DoHyki qualify as a fail safe wiki ?

Does "Oddmuse page synchronization" qualify as a fail safe wiki?

status wiki engines
Implemented -
Developing "David's distributed wiki"
Intend to Develop -
Considering -
Rejected -

Activity

Visual:DavidCary wants to to start a fail safe wiki. Help me!

MichaelSamuels is into this stuff. Thank you, very interesting. – – DavidCary 2005-03-17

Terminology

“single-point failure” – when a single physical component fails.

For example, hard drives are *expected* to fail after 5 years of on-time.

For example, the power cable *could* be unplugged.

(See Wiki:SinglePointOfFailure )

“fault-tolerant” – a system that keeps on working, even after a single-point failure.

“Cold rollover” – when something stops working after a single-point failure, but it resumes after a bit of repair work. For example, if I just finished backing up my hard drive onto tape, and then the hard drive fails, I haven’t lost any data (it’s all on tape). But that computer won’t do anything useful until I replace the hard drive and restore from backup.

“Hot rollover” – when something continues to work after a single-point failure. For example, some RAID disk arrays keep on working, never losing any files, even after a single hard drive fails. For example, if any of the 1000 (?) computers that Google uses fails/is unplugged, normal users never even notice.

“_???_ mode” – the system can tolerate any single-point failure.

“Limp mode” – the system continues to work after some single-point failures, but any further single-point failures could cause the whole system to fail.

“cold repair” – when a system that seems to be functioning normally (in Limp Mode) needs to be taken offline to restore it to _???_ mode. For example, many RAID systems continue to work (hot rollover) when a single hard drive fails. However, because they don’t use the more expensive hot-plug sockets, the system operator must shut everything down, turn off the power, plug in a fresh hard drive, then reboot the system.

“hot repair” – when a system in Limp Mode can be restored to _???_ mode, without users ever noticing.


Ideally, we want to design a system with both “hot rollover” *and* “hot repair”.

For example, if my computer had *2* power cables, I could move it from one room in my house to another, without ever turning it off, by (1) unplug one cable (now it’s in Limp Mode), (2) use a long extension cable to connect that cable to a socket in an intermediate room, (now it’s in _???_ mode) (3) unplug the *other* cable (Limp Mode again) (4) wheel the computer to the destination room, then (4) plug it in at the destination room.

Problems

See Also

distributed-replication software

Perhaps a wiki could be built directly on top of one of these networks (taking advantage of one of these already-existing networks), or at least could re-use some of the software developed for these networks:

(moved from CommunityWiki:DistributedEditing )

(See also WikiPedia:Distributed_file_system )

NNTP (Usenet News)

Someone (who doesn’t know about wiki ?)(who doesn’t know about CVS ?) once said:

I think NNTP – the protocol used by Usenet newsgroups – is the way to go for discussion groups or any type of shared (non-personal) “mailbox.” Collaborative filtering works well in this distributed system.http://deflexion.com/archiveb/2003_11_01_.php#106968751254380891

NNTP uses “cancel” messages to get rid of unwanted Usenet posts.

Could something that looks like Wiki be built on top of NNTP (rather than a version control system) ? Perhaps like this:

For single-machine wiki (non-fault-tolerant), this is more complicated than using subversion or CVS.

But for a distributed wiki, this can take advantage of all the NNTP network that already exists – it even allows people to add thread-mode comments using standard newsreader software.

BitTorrent

BitTorrent http://bitconjurer.org/BitTorrent/ has many similarities to NNTP; written in Python. It’s designed to be scalable. Would it be crazy to build a wiki on top of BitTorrent ? (I hear rumors that there is a BitTorrent-on-Java implementation …)

Freenet

The “freenet” seems to have a lot of fault-tolerance, and there is a wiki *about* the freenet at FreenetWiki http://wikiuniverse.com/cgi-bin/freenet.pl

“Freenet is coded primarily in Java, with Freenet-using applications coded in various languages.”

Mnet

“Mnet is a distributed file store. Mnet … is formed by an emergent network of autonomous nodes which self-organize to make the network robust and efficient.”

“Mnet is coded primarily in Python, with modules in C and C++”

“Mnet uses something called erasure codes which allow us to break an encoded file into several parts, and then to recreate the file from a subset of those parts. (For example we might break an encoded file into 24 parts where you only need any 12 of those parts to recreate the original file.)”

– from the Mnet FAQ http://sourceforge.net/projects/mnet/ http://mnetproject.org/

Mnet and Freenet

“Mnet and Freenet were originally conceived at about the same time, and without knowledge of one another. They have some similarities and many differences in their current implementation, in their architectures, and in their grand ambitions.”

“I encourage you to continue running both Mnet and Freenet”

– from the Mnet FAQ

MnesiaDatabase

See Wiki:MnesiaDatabase .

Global File System (GFS)

the Global File System http://sources.redhat.com/cluster/gfs/ , http://www.redhat.com/software/rha/gfs/

It creates what appears to be a single remote file system that many people can access (say, …/public_html/…), but (a) is fault tolerant – the system is distributed over many servers, with enough redundancy that any single server (any hard drive) can fail without any users noticing (no lost data). (b) distributes/balances the load over many servers.

It was developed for cluster computers.

Contributors

discussion

(EditHint: convert these random comments into DocumentMode.)

Would FaultTolerantWiki be a better name ?

DAV: I want a fault-tolerant wiki -- and I think the only way of doing that is to make it distributed (decentralized).

Creating remote backup copies of information, with hot-fail-over, is difficult when we’re worried about (a) privacy – people reading the information, and (b) immutablility – people changing the information. Since we could care less about that on a wiki, that should make make it much easier / simpler. A fault-tolerant wiki should be simpler to build than freenet.

“Smart designs are crafted to fail gracefully, to degrade rather than collapse, and to allow an emergency override when unanticipated stress threatens to drive things out of control. Good systems build in a tribunician veto, a panic button.” -- http://zhurnal.net/ww/zw?TheVeto

“With so many “textbook cases” of single points of failure, you’d think that we’d stop building systems to demonstrate the concept.” – Matt Curtin

Visual:DavidCary


(AB told me that the battery in his UPS failed on his server. The server continued to run just fine, but he was planning on powering it down to replace the battery. A “hot-rollover, cold-repair” system.)

Dear AB,

Summary: David rambles on and on in a long rant about “fault tolerance” and “single-point failure”.

Details: I’ve been thinking recently about how to make computers fault-tolerant.

It appears that your system only loses power when *both* the battery goes bad *and* the power goes out. So it’s a lot better than most systems.

However, WIBNI (wouldn’t it be nice if…) you could replace that battery without powering-down anything ?

I’ve been thinking the RAID idea is brilliant, but it doesn’t go far enough. That array of cheap disks connects to a single computer, to a single motherboard, connected to wall power through a single UPS. Taking it one step further, many rack-mount systems have multiple hot-swap-able power supplies connected via diodes to system power, so that if any one power supply (or any one diode) fails, the system continues to run while the failed item is replaced.

Instead of data replicated across multiple disks of one computer, I think data needs to be replicated across several as-independent-as-possible computers. Perhaps something like rsync through Ethernet. Data is distributed much like RAID, even though any one computer may only have a single hard drive.

I’m thinking about setting up a file-server made up of 2 or more sections, each with its own independent UPS, case, and Ethernet connection. If I unplug one section and stick it in a box, the rest of it continues to serve up files as if nothing had happened. If I ship that box anywhere in the world and plug it in, and it can re-connect through the Internet, it can re-sync its copy of the files. Then I can unplug any *other* section, and the rest continues to serve up files as if nothing had happened.

I really don’t even need to physically move the hardware – I can migrate by setting up fresh, blank, white-boxes at the destination, with just enough software to connect to the original file server and sync files. As long as I set up new boxes quick enough to give them time to synchronize before each section of the old file server is unplugged, I never lose any files, no matter how many users are reading and writing files continuously during the migration.

If we can somehow automatically re-direct users to the proper IP addresses for each part of the file server (is that possible ?), the users may never even notice that the file server has migrated.

In fact, I don’t even want to put all the new boxes in the same building – I want them in different cities, so that when someone’s backhoe cuts the data lines coming out of this building, everyone is re-directed to a box in a different city.

Surely someone has already thought this all up before now. Why don’t I already have all my files automatically backed up to a remote location every time I hit “save” ?

http://savetz.com/cheappc/ http://CommunityWiki.org/DistributingWiki http://c2.com/cgi/wiki?WaitFreeSynchronization

-- David Cary

Visual:DavidCary


From: AB To: David Cary

There are systems with dual power supplies,

It’s all a matter of money. There are systems that work that way.

Google is 8000+ computers in a redundant array so that when one fails, they just replace it and everything keeps running…

– AB


In fact, I don’t even want to put all the new boxes in the same building – I want them in different cities, so that when someone’s backhoe cuts the data lines coming out of this building, everyone is re-directed to a box in a different city.

indeed. i know how to keep my wiki running on my home servers. but to really make wikis securely fault-tolerant (i like that terminology much better than “failsafe”) means they need to be hosted on different boxes, in different cities, in different countries, on different continents. (just in case of meteor strike, *grin*.)

Surely someone has already thought this all up before now. Why don’t I already have all my files automatically backed up to a remote location every time I hit “save” ? i do. i use rsync for that, and it’s all automated. well, not every time i hit save, but i could actually do it that way if i wanted to. all the large software depositories have mirrors. the big difference is that changes on mirror systems all originate in one place, if i understand this correctly. for wikis, you might have to build a bi-directional mirror, unless you want to go with the master/slave idea. – CommunityWiki:piranha

I see. Now I need to add rsync to my chron. – DavidCary

Wikimedia/Wikipedia

Wikimedia’s structure is typically:

Work on database load balancing is going on and when complete that will allow the site to survive a query database server failure without people noticing (if we have sufficient capacity for the resulting load on the others!). With a little under a million edits in June 2004 and pretty frequent references to the site, people are starting to rely on us always being available and that plus the need to scale performance to handle the demand is driving several changes we’re making. In general, we’re trying to arrange to have enough of everything that people won’t notice a single failure anywhere.

The DNS server load balances the Squids with round robin DNS. The Squids load balance the web servers (response time based - they are asked how they are doing and the fastest responses drive which gets the request). Trying to keep up with load growth is always an issue, with load hotspots moving around.

Longer term it’s likely that there will be at least one remote database slave for disaster protection. It’s also likely that there will be cache servers spread around the world. People outside the US are starting to get seriously interested in doing this.

Never forget the scope for human error. One way to protect against some forms is to have a slave which is regularly taken out of sync, backed up, then allowed to resync. If there’s a problem with human error, like deleting something critical, that gives you a backup to recover from. The slaves won’t help - they will replicate the deletion. Wikipedia is giving some thought to this possibility. Resources are likely to be what stops it, at least for a while.

other ideas

DVPC - Distributed Virtual Personal Computer by Jack Krupansky http://basetechnology.com/dvpc.htm extends this idea of a “fault-tolerant distributed storage” to everything on a person’s hard drive. Data is stored on 3 or so other places on the network. The local hard drive is “merely” a cache. Then when someone’s laptop crashes or is stolen, he simply goes to another PC or buys a new laptop, and (after a bit of lag as things are restored from the network) everything works exactly same as before.


//Kenosis and the World Free Web// by Eric Ries, Jan 8th 2005 “Kenosis is a fully-distributed peer-to-peer RPC system built on top of XMLRPC.” … “Our modified BitTorrent software is 100% backwards-compatible with existing BitTorrent clients via a DNS-based Kenosis bridge. People currently running BitTorrent trackers can safely upgrade to a Kenosis-enabled tracker, taking advantage of the built-in failover that Kenosis provides, without fear of breaking compatibility with any existing clients.”

http://kenosis.sourceforge.net/


“MUTE File Sharing is a peer-to-peer network that provides easy search-and-download functionality” http://mute-net.sourceforge.net/


“Planet Peer - The anonymous networking community” http://board.planetpeer.de/


“GNUnet is a framework for secure peer-to-peer networking that does not use any centralized or otherwise trusted services. A first service implemented on top of the networking layer allows anonymous censorship-resistant file-sharing.” http://gnunet.org/


Doesn’t Google do something like this for there cached internet? triple redundancy or something?