wiki:Addressing

Addressing

How to name (address, find) objects in Nebulostore. Application layer can't be aware of the physical location of a nebulostore object - because replicators of the object may change without app layer knowledge (Nebulostore can optimize placement of replicas).

Thus, nebulostore must use logical addresses to name objects; these addresses are translated to locations by a distributed directory service (a directory structure or a DHT).

NebuloKey - a logical address

Similar to a complete path in a filesystem ( /var/log/system.log );

( (appKey), (dirId1, entryId1), (dirId2, entryId2), ..., (dirIdN, entryIdN), [ objectId ] )

appKey maps application-level key (a sequence of bytes) to the top-level nebulostore directory dir1 (in fact, to a list of its replicas). entryId1 is an entry in dir1 that specifies the set of physical addresses of dir2 (child directory). entryIdN specifies the set of physical addresses of the target object.

objectId can be given if known (caching), but is not compulsory.

dirIds are explicitly present in the key, so that they can be cached by Nebulostore

Physical addressing

An entry in the directory (describing a particular object) specifies:

  • the transport-level addresses of replicators of the object (IP,port; or JXTA peer ID)
  • the object's objectId

An object can be downloaded by:

  1. connecting to one of the replicators using the replicator's transport-level address
  2. requesting objectId from the replicator

Encryption problems with physical addressing

objId can be encrypted (with broadcast encryption). No problems.

Replicators of the object may change: nebulostore Broker ? may move the object from one replicator to another (e.g. to optimize availability). Thus, if the list of replicators is encrypted, the broker must be able to change it (decrypt, create a new list, encrypt). Thus, the broker must know target users' IDs.

Alternative: the list of replicators is not encrypted. Is it a security issue? A well-behaving replicator should not list all the objects it is storing; and the objects are encrypted.

Write bursts: when a replicator changes (replicas are moved from one physical location to another), a lot of physical addresses must be changed as well. Thus, not only each object needs back-pointers to directories which reference it; all this locations must be updated.

Directories

Directory (NebuloDir) is an object holding a set of (encrypted) entries. Normally, an entry is a PhysicalAddress (HardLink) to an object.

To avoid expensive fetches of very small files (e.g. comments on the wall) small data can be stored directly inside directory entry (so-called InlineData).

An entry can also store LogicalAddress (SoftLink); soft links are small and stored directly inside directory entry.

CPT: an alternative to the physical addressing

CPT (Complicated Piotr's Tree): a multi-level structure similar to blocks in an ext filesystem.

An entry in the directory (describing a particular object) specifies:

  • CPT address of the object (a sequence of bytes)
  • the object's objId

CPT is a series of tables. Let's take base-10 CPT address. Level 1 table maps the first digit of the CPT address to the replicators storing level 2 table. Thus, for the CPT address 1234:

  1. we connect to the root table replicator;
  2. we take the second entry (1) of the level 1 table from level 1 replicator; the entry specifies the transport layer address of the replicator of level 2 table
  3. we connect to the level 2 replicator
  4. we take the third entry (2) of the level 2 table from level 2 replicator
  5. etc.
  6. the last entry specifies the transport layer address of the target object

Pros & cons:

  • adds log(key length) to the object seek time
  • CPT address can be encrypted; a nebulostore broker changes the entry in the last level CPT table, not the directory
  • high-level tables are strongly replicated - it should be possible to locate a nearby replica
  • CPT replace DHTs (with better replication!) - they're like hierarchical DHTs; no need to have a separate DHT to map appKeys to nebolostore keys
  • design complexity is high
  • load of replicators having high-level tables is high;

CPT degrades into a DHT

A binary CPT: the root resolves first bit of the key; left child resolves 0x; right child resolves 1x

Assume the root has n replicas

load-balancing

k total number of resolve requests

each child has k/2 requests

thus each child must be replicated n/2 times so that its replicators have the same load as replicators of the

by similar argument, 3-rd level (root's grand-children) have n/4 replicators

there are 2 entries in the root; each entry is a list of n/2 replicators; in total n links are stored

but if the root stored direct links to the 3-rd level, it also stores n links

availability

u - unavailability of a replicator (constant, the same for all replicators)

a resolve fails if:

  • the root resolve fails (no replicators of the root)
  • OR, the root resolve succeeds, the second level fails
  • OR, root, second level succeeds, 3rd level fails
  • etc

let f_i probability of failure at i-th level

optimization: min f_1 + (1-f_1) f_2 + (1-f_1) (1-f_2) f_3 + ...

no easy solution

DHT2

LogicalAddress = (groupId, fileId)

groupId is mapped by the global DHT to a few replicators that take part in the second-level DHT. Second-level DHT maps fileId -> physicalAddress. (OR: fileId -> object ?).

Resiliency / availability in the second-level DHT is managed by the Broker of the (real) owner of the file. The broker can request some/all of its replicators to take part in the second-level DHT.

A node can be at the same time in many second-level DHTs (because it has many replication agreements). Each such DHT is identified by groupId.

direct placement: second DHT: (fileId -> object)

  • smaller latency
  • less flexibility in managing replication space (but keys can be generated so that they're evenly distributed in the DHT)
  • if data grows, objects must be migrated between DHT nodes
    • at the beginning, whole key space managed by n nodes (each having an exact copy)
    • after replication space is filled, should we start a new group? Or add new replicators to the existing group and migrate half of the data there?
  • probably, the second DHT degrades into just a list of replicas storing copies
  • so, there is no second DHT; the first DHT stores all replicas for the groupID

indirect placement: second DHT: (fileId -> physicalAddress)

  • higher latency (more redirections)
  • second DHT is independent of the data placement (assuming that files are bigger than metadata)

DHT + treshold hashing

LogicalAddress = (groupId, objectId)

DHT maps groupId to a list of sets of replicators: {{{#!sh [ {R1, R2, R3}, {R1, R4, R5}, ... ] }}} (replicator R1 is in the set S1 and S2) and a list of tresholds for objectIds {{{#!sh [ t1, t2, ...,] }}} (the list have equal sizes). If t(i-1) < objectId <= t(i), the file is stored by replication set Si.

A replicator set corresponds to a set of replication agreements with equal sizes (e.g., we have 1GB agreements with R2, R3, R4, R5 and 2GB with R1; thus R1 is in two replicator sets).

The owner assignes objectIds increasingly - but not necessarily - sequentially. If the set of objectIds is sparse (e.g. t1 = 10000; and assigned ObjectIds are {100, 200, 300, 400, ..., 10000}), we can later add new objects to a replication group.

An incrementing series of objectIds may leak a little bit metadata-wise, but I don't think it's a big problem (it's very possible that that information needs to leak for other reasons anyway, e.g., allowing conflict resolution on concurrent writes).

A replicator set becomes full if the data size becomes close to the size of the replication agreements.

replicator failures

If a replicator goes down, it's replaced by a new one that takes content from other replicators in the same replication set. The replicator's address is put in the first DHT, so there are no costly updates.

modification of objects

Assume an object is stored by a replication set that is full (1GB agreement, 1GB data). If the object grows, we can

  • create a new object and replace the old one with a symlink (groupId, objectId) to the new one. This results in internal fragmentation - unless we can put a new object in the old replication set. We can do it if there are free identifiers ( t(i-1) < objectId < t(i) )
  • decrease the threshold a bit (i.e., moving some items to later replication groups) to accommodate growing without changing the objectId. This might be costly as changes are cascaded to later replication groups (S1 -> S2 -> S3 -> ...)

ideas to keep in mind:

we're assuming/expecting that storage and not serving data is the bottle neck

Serving data can be handled by bittorrent-like swarms with original replicators acting as seeds & trackers.

Design issues:

  • how to apply BitTorrent as to keep overhead at a reasonable level;
  • how to manage connections to swarms after a non-replicator completely downloads something (how long do you stay, and in which swarms)
  • how to extend the BitTorrent protocol to deal with a write during download (re-utilizing already downloaded pieces that did not change):
    • all the pieces could be versioned (versionId = sequence of random numbers, new version-append to the sequence).
    • The trackers keep versionIds along with each block, so each member of a swarm knows what should be the correct version.

files with different sizes treated differently

classes of files based on log(size)

  • small files (<1MB)
  • mid-size files --- pictures ([1MB, 10MB])
  • big files -> videos

trade-off: hashing / fragmentation / flexibility in choosing a replica for a file

Last modified 5 years ago Last modified on 10/29/13 23:16:47