StorageConfiguration - Cassandra Wiki

来源:百度文库 编辑:神马文学网 时间:2024/06/05 12:02:34

Cassandra storage configuration is described by the conf/storage-conf.xml file. As the syntax evolves with releases, this wiki page tries to document those changes using [New in X.Y: ....] lines.

AutoBootstrap

[New in 0.5:

Turn on to make new [non-seed] nodes automatically migrate the right data to themselves. (If no InitialTokenis specified, they will pick one such that they will get half therange of the most-loaded node.) If a node starts up withoutbootstrapping, it will mark itself bootstrapped so that you can'tsubsequently accidently bootstrap a node with data on it. (You canreset this by wiping your data and commitlog directories.)

Offby default so that new clusters and upgraders from 0.4 don't bootstrapimmediately. You should turn this on when you start adding new nodes toa cluster that already has data on it. (If you are upgrading from 0.4,start your cluster with it off once before changing it to true.Otherwise, no data will be lost but you will incur a lot of unnecessaryI/O before your cluster starts up.)

  false

]

Cluster Name

The name of this cluster. This is mainly used to prevent machines in one logical cluster from joining another.

Example:

Test Cluster

Authenticator

[New in 0.6:

Allowsfor pluggable authentication of users, which defines whether it isnecessary to call the Thrift 'login' method, and which parameters arerequired to login. The default 'AllowAllAuthenticator' does not requireusers to call 'login': any user can perform any operation. The otherbuilt in option is 'SimpleAuthenticator', which requires users andpasswords to be defined in property files, and for users to call loginwith a valid combo.

Example:

org.apache.cassandra.auth.SimpleAuthenticator

]

Keyspaces and ColumnFamilies

Keyspaces and ColumnFamilies: A ColumnFamily is the Cassandra concept closest to a relational table. Keyspaces are separate groups of ColumnFamilies. Except in very unusual circumstances you will have one Keyspace per application.

There is an implicit keyspace named 'system' for Cassandra internals.

[New in 0.5:

Thefraction of keys per sstable whose locations we keep in memory in"mostly LRU" order. (JUST the key locations, NOT any column values.)The amount of memory used by the default setting of 0.01 is comparableto the amount used by the internal per-sstable key index. Considerincreasing this if you have fewer, wider rows. Set to 0 to disableentirely.

      0.01

]

[Newin 0.6: EndPointSnitch, ReplicaPlacementStrategy and ReplicationFactorbecame configurable per keyspace. Prior to that they were globalsettings.]

EndPointSnitch

EndPointSnitch: Setting this to the class that implements IEndPointSnitch which will see if two endpoints are in the same data center or on the same rack. Out of the box, Cassandra provides org.apache.cassandra.locator.EndPointSnitch

org.apache.cassandra.locator.EndPointSnitch

Note: this class will work on hosts' IPs only. There is no configuration parameter to tell Cassandra that a node is in rack R and in datacenter D. The current rules are based on the two methods: (see EndPointSnitch.java):

  • isOnSameRack: Look at the IP Address of the two hosts. Compare the 3rd octet. If they are the same then the hosts are in the same rack else different racks.
  • isInSameDataCenter: Look at the IP Address of the two hosts. Compare the 2nd octet. If they are the same then the hosts are in the same datacenter else different datacenter.

ReplicaPlacementStrategy and ReplicationFactor

Strategy: Setting this to the class that implements IReplicaPlacementStrategy will change the way the node picker works. Out of the box, Cassandra provides org.apache.cassandra.locator.RackUnawareStrategy and org.apache.cassandra.locator.RackAwareStrategy (place one replica in a different datacenter, and the others on different racks in the same one.)

org.apache.cassandra.locator.RackUnawareStrategy

Number of replicas of the data

1

Note that the replication factor (RF) is the totalnumber of nodes onto which the data will be placed. So, a replicationfactor of 1 means that only 1 node will have the data. It does not mean that one other node will have the data.

ColumnFamilies

The CompareWith attribute tells Cassandra how to sort the columns for slicing operations. The default is BytesType, which is a straightforward lexical comparison of the bytes in each column. Other options are AsciiType, UTF8Type, LexicalUUIDType, TimeUUIDType, and LongType. You can also specify the fully-qualified class name to a class of your choice extending org.apache.cassandra.db.marshal.AbstractType.

  • SuperColumns have a similar CompareSubcolumnsWith attribute.

  • BytesType: Simple sort by byte value. No validation is performed.

  • AsciiType: Like BytesType, but validates that the input can be parsed as US-ASCII.

  • UTF8Type: A string encoded as UTF8

  • LongType: A 64bit long

  • LexicalUUIDType: A 128bit UUID, compared lexically (by byte value)

  • TimeUUIDType: a 128bit version 1 UUID, compared by timestamp

(To get the closest approximation to 0.3-style supercolumns, you would use CompareWith=UTF8Type CompareSubcolumnsWith=LongType.)

If FlushPeriodInMinutesis configured and positive, it will be flushed to disk with that periodwhether it is dirty or not. This is intended for lightly-used columnfamilies so that they do not prevent commitlog segments from being purged.

[New in 0.5: An optional Comment attribute may be used to attach additional human-readable information about the column family to its definition. ]

Partitioner

Partitioner: any IPartitioner may be used, including your own as long as it is on the classpath. Out of the box, Cassandra provides org.apache.cassandra.dht.RandomPartitioner, org.apache.cassandra.dht.OrderPreservingPartitioner, and org.apache.cassandra.dht.CollatingOrderPreservingPartitioner.(CollatingOPP colates according to EN,US rules, not naive byteordering. Use this as an example if you need locale-aware collation.)Range queries require using an order-preserving partitioner.

Achtung!Changing this parameter requires wiping your data directories, sincethe partitioner can modify the !sstable on-disk format.

Example:

org.apache.cassandra.dht.RandomPartitioner

Ifyou are using an order-preserving partitioner and you know your keydistribution, you can specify the token for this node to use. (Keys aresent to the node with the "closest" token, so distributing your tokensequally along the key distribution space will spread keys evenly acrossyour cluster.) This setting is only checked the first time a node isstarted.

This can also be useful with RandomPartitioner to force equal spacing of tokens around the hash space, especially for clusters with a small number of nodes.

Cassandra uses MD5 hash internally to hash the keys to place on the ring in a RandomPartitioner. So it makes sense to divide the hash space equally by the number of machines available using InitialTokenie, If there are 10 machines, each will handle 1/10th of maximum hashvalue) and expect that the machines will get a reasonably equal load.

With OrderPreservingPartitionerthe keys themselves are used to place on the ring. One of the potentialdrawback of this approach is that if rows are inserted with sequentialkeys, all the write load will go to the same node.

Directories

Directories: Specify where Cassandra should store different data on disk. Keep the data disks and the CommitLog disks separate for best performance. See also what kind of hardware should I use?

/var/lib/cassandra/commitlog/var/lib/cassandra/data

Seeds

Addressesof hosts that are deemed contact points. Cassandra nodes use this listof hosts to find each other and learn the topology of the ring. You mustchange this if you are running multiple nodes!

127.0.0.1

Never use a node's own address as a seed if you are bootstrapping it by setting AutoBootstrap to true.

Miscellaneous

Time to wait for a reply from other nodes before failing the command

5000

Size to allow commitlog to grow to before creating a new segment

128

Local hosts and ports

Addressto bind to and tell other nodes to connect to. You _must_ change thisif you want multiple nodes to be able to communicate!

Leaving it blank leaves it up to InetAddress.getLocalHost().This will always do the Right Thing *if* the node is properlyconfigured (hostname, name resolution, etc), and the Right Thing is touse the address associated with the hostname (it might not be). The ControlPort setting is deprecated in 0.6 and can be safely removed from configuration.

localhost70007001

The address to bind the Thrift RPC service to. Unlike ListenAddress above, you *can* specify 0.0.0.0 here if you want Thrift to listen on all interfaces.

Leaving this blank has the same effect it does for ListenAddress, (i.e. it will be based on the configured hostname of the node).

localhost9160

Whetheror not to use a framed transport for Thrift. If this option is set totrue then you must also use a framed transport on the client-side,(framed and non-framed transports are not compatible).

false

Memory, Disk, and Performance

Accessmode. mmapped i/o is substantially faster, but only practical on a64bit machine (which notably does not include EC2 "small" instances) orrelatively small datasets. "auto", the safe choice, will enablemmapping on a 64bit JVM. Other values are "mmap", "mmap_index_only"(which may allow you to get part of the benefits of mmap on a 32bitmachine by mmapping only index files) and "standard". (The buffer sizesettings that follow only apply to standard, non-mmapped i/o.)

auto

Buffersize to use when performing contiguous column slices. Increase this tothe size of the column slices you typically perform. (Name-basedqueries are performed with a buffer size of !ColumnIndexSizeInKB.)

64

Buffersize to use when flushing !memtables to disk. (Only one !memtable isever flushed at a time.) Increase (decrease) the index buffer sizerelative to the data buffer if you have few (many) columns per key.Bigger is only better _if_ your !memtables get large enough to use thespace. (Check in your data directory after your app has been runninglong enough.)

328

Addcolumn indexes to a row after its contents reach this size. Increase ifyour column values are large, or if you have a very large number ofcolumns. The competing causes are, Cassandra has to deserialize thismuch of the row to read a single column, so you want it to be small - atleast if you do many partial-row reads - but all the index data is readfor each access, so you don't want to generate that wastefully either.

64

Themaximum amount of data to store in memory per ColumnFamily beforeflushing to disk. Note: There is one memtable per column family, andthis threshold is based solely on the amount of data stored, not actualheap memory usage (there is some overhead in indexing the columns). Seealso MemtableThresholds.

64

The maximum number of columns in millions to store in memory per ColumnFamily before flushing to disk. This is also a per-memtable setting. Use with MemtableSizeInMB to tune memory usage.

0.1

[New in 0.5

Themaximum time to leave a dirty memtable unflushed. (While any affectedcolumnfamilies have unflushed data from a commit log segment, thatsegment cannot be deleted.) This needs to be large enough that it won'tcause a flush storm of all your memtables flushing at once because nonehas hit the size or count thresholds yet. For production, a largervalue such as 1440 is recommended.

  60

]

Unlikemost systems, in Cassandra writes are faster than reads, so you canafford more of those in parallel. A good rule of thumb is 2 concurrentreads per processor core. Increase ConcurrentWrites to the number of clients writing at once if you enable CommitLogSync + CommitLogSyncDelay.

832

CommitLogSyncmay be either "periodic" or "batch." When in batch mode, Cassandrawon't ack writes until the commit log has been fsynced to disk. It willwait up to CommitLogSyncBatchWindowInMS milliseconds for other writes, before performing the sync.

Thisis less necessary in Cassandra than in traditional databases sincereplication reduces the odds of losing data from a failure after writingthe log entry but before it actually reaches the disk. So the otheroption is "timed," where writes may be acked immediately and the CommitLog is simply synced every CommitLogSyncPeriodInMS milliseconds.

periodic

Interval at which to perform syncs of the CommitLog in periodic mode. Usually the default of 1000ms is fine; increase it only if the CommitLog PendingTasks backlog in jmx shows that you are frequently scheduling a second sync while the first has not yet been processed.

1000

Delay(in milliseconds) during which additional commit log entries may bewritten before fsync in batch mode. This will increase latencyslightly, but can vastly improve throughput where there are manywriters. Set to zero to disable (each entry will be syncedindividually). Reasonable values range from a minimal 0.1 to 10 or evenmore if throughput matters more than latency.

Timeto wait before garbage-collection deletion markers. Set this to alarge enough value that you are confident that the deletion marker willbe propagated to all replicas by the time this many seconds has elapsed,even in the face of hardware failures. The default value is ten days.

864000

Numberof threads to run when flushing memtables to disk. Set this to thenumber of disks you physically have in your machine allocated for DataDirectory * 2.If you are planning to use the Binary Memtable, its recommended toincrease the max threads to maintain a higher quality of service whileunder load when normal memtables are flushing to disk.

11

The threshold size in megabytes the binary memtable must grow to, before it's submitted for flushing to disk.

256

Including Configuration Fragments

It's commonthat a Cassandra configuration will be shared among many machines butneeds to be slightly tuned on each one (directories are different,memory available is less, etc.). You can include a XML fragment withthis syntax.

...]>......&seeds;&directories;&network;&tuning;...

Andthen the external files are simply what you'd specify inline, forexample directories.xml. Note these fragments are not valid XML alone.

  /var/lib/cassandra/commitlog/var/lib/cassandra/data/var/lib/cassandra/staging

StorageConfiguration (2010-07-19 15:34:33由DaveViner编辑)