hbase-put流程剖析

众所周知hbase是一个写性能非常优越的NOSQL，今天从源码中分析一下put操作到了region中是如何进行处理的。

region中处理put请求的大致流程如下图所示：

put流程图

1）检查region状态并尝试获取region中读写锁中的读锁，1.检查读操作时是否可读。2.检查region是否rit状态，是则不允许read、split和merge操作。3.获取region读写锁中的读锁。4.如果region绑定了协处理器则将operation发送至协处理器。

2）检查memstore大小，该region上memstore总大小是否超过blockingMemStoreSize（hbase.hregion.memstore.block.multiplier*hbase.hregion.memstore.flush.size）的大小，超过则调用memstore flush并抛出RegionTooBusyException Above memstore limit异常。如果put操作的是meta所在的region则不进行memstore检查。

3）检查put中的数据，1.检查put中的的cf是否存在表中定义，不存在则会返回NoSuchColumnFamilyException。2.检查put中的timestamp是否大于hbase.hregion.keyvalue.timestamp.slop.millisecs+当前时间，大于则抛出FailedSanityCheckException Timestamp for KV out of range 异常。3.检查put中的rowkey是否在region的[starRowKey,endRowKey)范围内。

4）获取行级锁，获取第一个行级锁时会阻塞获取行级锁直到hbase.rowlock.wait.duration超时，获取非第一个行级锁时如果获取不到则会马上返回不会阻塞，这样可以减少前面已经获取到行级锁的锁占用时间。

5）修改写入数据的timestamp，将待写入数据的timestamp修改为当前时间。

6）获取region中updatesLock的读锁。

7）构建WAL Edit并写入Appen Hlog中，HBase使用WAL机制保证数据可靠性，即首先写日志再写缓存，即使发生宕机，也可以通过恢复HLog还原出原始数据。该步骤就是将数据构造为WALEdit对象，然后顺序写入HLog中，此时不需要执行sync操作。

8）将数据写入memstore，HBase中每列族都会对应一个store，用来存储该列数据。每个store都会有个写缓存memstore，用于缓存写入数据。HBase并不会直接将数据落盘，而是先写入缓存，等缓存满足一定大小之后再一起落盘。

9）释放region中updatesLock的读锁和行级锁。

10）sync wal,将Hlog真正写入hdfs中，释放锁之后才进行该步骤是为了减少锁的占用时间，提高写性能，如果Sync失败，执行回滚操作将memstore中已经写入的数据移除。

11）移动MVCC，此操作之后写入的数据才能被读操作（get、scan）获取到。

12）检查memstore size，memstore的大小是否大于hbase.hregion.memstore.flush.size，是则进行flush操作。

主要源码如下，已在源码中添加了注释，

OperationStatus[] batchMutate(BatchOperationInProgress batchOp) ：

OperationStatus[] batchMutate(BatchOperationInProgress<?> batchOp) throws IOException {

boolean initialized = false;

Operation op = batchOp.isInReplay() ? Operation.REPLAY_BATCH_MUTATE : Operation.BATCH_MUTATE;

//1.检查读操作时是否可读

//2.检查region是否rit状态，是则不允许read、split和merge操作

//3.获取region读写锁中的读锁

//4.如果region绑定了协处理器则将operation发送至协处理器

startRegionOperation(op);

try {

while (!batchOp.isDone()) {

if (!batchOp.isInReplay()) {

checkReadOnly();

}

//检查memstore的大小是否超过blockingMemStoreSize的大小，meta所在的region不检查。

checkResources();

if (!initialized) {

this.writeRequestsCount.add(batchOp.operations.length);

//非回放的操作，会先调用各自操作协处理器的pre方法。

if (!batchOp.isInReplay()) {

doPreMutationHook(batchOp);

}

initialized = true;

}

// 处理请求，主要处理逻辑在该方法中

doMiniBatchMutation(batchOp);

long newSize = this.getMemstoreSize();

//判断memstore的大小是否大于hbase.hregion.memstore.flush.size，是则进行flush操作。

if (isFlushSize(newSize)) {

requestFlush();

}

}

} finally {

//1.释放region读写锁的读锁

//2.如果region绑定了协处理器则将operation发送至协处理器

closeRegionOperation(op);

}

return batchOp.retCodeDetails;

}

private long doMiniBatchMutation(BatchOperationInProgress batchOp)：

private long doMiniBatchMutation(BatchOperationInProgress batchOp)throws IOException {

boolean isInReplay = batchOp.isInReplay();

// variable to note if all Put items are for the same CF -- metrics related

//记录所有的批量操作中是不是否操作一样的cf，如果都是操作相同的cf，则可以将批量put操作转换为multiput进行优化。

boolean putsCfSetConsistent =true;

//The set of columnFamilies first seen for Put.

Set putsCfSet =null;

// variable to note if all Delete items are for the same CF -- metrics related

boolean deletesCfSetConsistent =true;

//The set of columnFamilies first seen for Delete.

Set deletesCfSet =null;

long currentNonceGroup = HConstants.NO_NONCE, currentNonce = HConstants.NO_NONCE;

WALEdit walEdit =new WALEdit(isInReplay);

MultiVersionConcurrencyControl.WriteEntry writeEntry =null;

long txid =0;

boolean doRollBackMemstore =false;

boolean locked =false;

/** Keep track of the locks we hold so we can release them in finally clause */

//记录获取到的行级锁，方便进行释放。

List acquiredRowLocks = Lists.newArrayListWithCapacity(batchOp.operations.length);

// reference family maps directly so coprocessors can mutate them if desired

Map>[] familyMaps =new Map[batchOp.operations.length];

// We try to set up a batch in the range [firstIndex,lastIndexExclusive)

int firstIndex = batchOp.nextIndexToProcess;

int lastIndexExclusive = firstIndex;

RowLock prevRowLock =null;

boolean success =false;

int noOfPuts =0, noOfDeletes =0;

WALKey walKey =null;

long mvccNum =0;

long addedSize =0;

try {

// ------------------------------------

// STEP 1. Try to acquire as many locks as we can, and ensure

// we acquire at least one.

// ----------------------------------

int numReadyToWrite =0;

long now = EnvironmentEdgeManager.currentTime();

while (lastIndexExclusive < batchOp.operations.length) {

Mutation mutation = batchOp.getMutation(lastIndexExclusive);

boolean isPutMutation = mutationinstanceof Put;

Map> familyMap = mutation.getFamilyCellMap();

// store the family map reference to allow for mutations

familyMaps[lastIndexExclusive] = familyMap;

// skip anything that "ran" already

if (batchOp.retCodeDetails[lastIndexExclusive].getOperationStatusCode()

!= OperationStatusCode.NOT_RUN) {

lastIndexExclusive++;

continue;

}

try {

if (isPutMutation) {

// Check the families in the put. If bad, skip this one.

//如果是数据回放则移除数据中cf在htableDescriptor中不存在的

if (isInReplay) {

removeNonExistentColumnFamilyForReplay(familyMap);

}else {//如果是非回放数据，则检查htableDescriptor中是否存在需要操作的cf，不存在则会返回NoSuchColumnFamilyException

checkFamilies(familyMap.keySet());

}

//判读操作的keyvlue中的timestamp是否大于hbase.hregion.keyvalue.timestamp.slop.millisecs+当前时间，大于则抛出FailedSanityCheckException Timestamp for KV out of range 异常。

checkTimestamps(mutation.getFamilyCellMap(), now);

}else {

//删除操作，

// 如果是删除整行操作，则将htableDescriptor中定义的cf设置到Delete对象中进行删除整行。

// 如果不是删除整行操作，则检查Delete中的cf是否在htableDescriptor中定义。

prepareDelete((Delete) mutation);

}

//检查操作的rowkey是否在本region的[starRowKey,endRowKey)范围内。

checkRow(mutation.getRow(), "doMiniBatchMutation");

}catch (NoSuchColumnFamilyException nscf) {

LOG.warn("No such column family in batch mutation", nscf);

batchOp.retCodeDetails[lastIndexExclusive] =new OperationStatus(

OperationStatusCode.BAD_FAMILY, nscf.getMessage());

lastIndexExclusive++;

continue;

}catch (FailedSanityCheckException fsce) {

LOG.warn("Batch Mutation did not pass sanity check", fsce);

batchOp.retCodeDetails[lastIndexExclusive] =new OperationStatus(

OperationStatusCode.SANITY_CHECK_FAILURE, fsce.getMessage());

lastIndexExclusive++;

continue;

}catch (WrongRegionException we) {

LOG.warn("Batch mutation had a row that does not belong to this region", we);

batchOp.retCodeDetails[lastIndexExclusive] =new OperationStatus(

OperationStatusCode.SANITY_CHECK_FAILURE, we.getMessage());

lastIndexExclusive++;

continue;

}

//HBASE-18233

// If we haven't got any rows in our batch, we should block to

// get the next one's read lock. We need at least one row to mutate.

// If we have got rows, do not block when lock is not available,

// so that we can fail fast and go on with the rows with locks in

// the batch. By doing this, we can reduce contention and prevent

// possible deadlocks.

// The unfinished rows in the batch will be detected in batchMutate,

// and it wil try to finish them by calling doMiniBatchMutation again.

// 获取行级锁。

// 当此次批处理中还没有获取到任何一个行的行级锁则会阻塞获取行级锁，直到hbase.rowlock.wait.duration超时

// 当此次批处理中已有获得的行级锁则会则会尝试获取本行的行级锁，如果获取不到本行行级锁则立马返回不阻塞，减少其他已获取的行级锁的占用时间也尽量避免了死锁。

// 获取不到行级锁则会跳出while循环，此次批处理就只处理已获取到行级锁的行，获取不到行级锁的操作会在下一次批处理中进行处理。

boolean shouldBlock = numReadyToWrite ==0;

RowLock rowLock =null;

try {

//获取行级锁

rowLock = getRowLockInternal(mutation.getRow(), true, shouldBlock, prevRowLock);

}catch (IOException ioe) {

LOG.warn("Failed getting lock in batch put, row="

+ Bytes.toStringBinary(mutation.getRow()), ioe);

}

//获取不到行级锁则跳出while循环

if (rowLock ==null) {

// We failed to grab another lock. Stop acquiring more rows for this

// batch and go on with the gotten ones

break;

}else {

//当前获取的行级锁和上一个行级锁不一致，则会添加到已获取行级锁的集合中，方便对锁进行释放。

if (rowLock != prevRowLock) {

// It is a different row now, add this to the acquiredRowLocks and

// set prevRowLock to the new returned rowLock

acquiredRowLocks.add(rowLock);

prevRowLock = rowLock;

}

}

lastIndexExclusive++;

numReadyToWrite++;

if (isPutMutation) {

// If Column Families stay consistent through out all of the

// individual puts then metrics can be reported as a mutliput across

// column families in the first put.

if (putsCfSet ==null) {

putsCfSet = mutation.getFamilyCellMap().keySet();

}else {

putsCfSetConsistent = putsCfSetConsistent

&& mutation.getFamilyCellMap().keySet().equals(putsCfSet);

}

}else {

if (deletesCfSet ==null) {

deletesCfSet = mutation.getFamilyCellMap().keySet();

}else {

deletesCfSetConsistent = deletesCfSetConsistent

&& mutation.getFamilyCellMap().keySet().equals(deletesCfSet);

}

}

}//while循环结束到这

// we should record the timestamp only after we have acquired the rowLock,

// otherwise, newer puts/deletes are not guaranteed to have a newer timestamp

// 在获取完行级锁之后获取当前时间，作为cell中的timestamp。

now = EnvironmentEdgeManager.currentTime();

byte[] byteNow = Bytes.toBytes(now);

// Nothing to put/delete -- an exception in the above such as NoSuchColumnFamily?

//如果没有获取到任何一个行级锁则返回0。

//优化：这个判断可以放到获取当前时间前面，当没有获取到行级锁时少获取一次当前时间，能优化一点是一点

if (numReadyToWrite <=0)return 0L;

// We've now grabbed as many mutations off the list as we can

// ------------------------------------

// STEP 2. Update any LATEST_TIMESTAMP timestamps

// ----------------------------------

for (int i = firstIndex; !isInReplay && i < lastIndexExclusive; i++) {

// skip invalid

if (batchOp.retCodeDetails[i].getOperationStatusCode()

!= OperationStatusCode.NOT_RUN)continue;

Mutation mutation = batchOp.getMutation(i);

if (mutationinstanceof Put) {

//使用byteNow timestamp更新Put cell的timestamp。

updateCellTimestamps(familyMaps[i].values(), byteNow);

noOfPuts++;

}else {

prepareDeleteTimestamps(mutation, familyMaps[i], byteNow);

noOfDeletes++;

}

//添加ttl

rewriteCellTags(familyMaps[i], mutation);

}

//获取region中updatesLock的读锁

lock(this.updatesLock.readLock(), numReadyToWrite);

locked =true;

// calling the pre CP hook for batch mutation

if (!isInReplay &&coprocessorHost !=null) {

MiniBatchOperationInProgress miniBatchOp =

new MiniBatchOperationInProgress(batchOp.getMutationsForCoprocs(),

batchOp.retCodeDetails, batchOp.walEditsFromCoprocessors, firstIndex, lastIndexExclusive);

if (coprocessorHost.preBatchMutate(miniBatchOp))return 0L;

}

// ------------------------------------

// STEP 3. Build WAL edit

// ----------------------------------

Durability durability = Durability.USE_DEFAULT;

for (int i = firstIndex; i < lastIndexExclusive; i++) {

// Skip puts that were determined to be invalid during preprocessing

if (batchOp.retCodeDetails[i].getOperationStatusCode() != OperationStatusCode.NOT_RUN) {

continue;

}

Mutation m = batchOp.getMutation(i);

//获取wal持久化级别SKIP_WAL、ASYNC_WAL、SYNC_WAL、FSYNC_WAL等

Durability tmpDur = getEffectiveDurability(m.getDurability());

if (tmpDur.ordinal() > durability.ordinal()) {

durability = tmpDur;

}

if (tmpDur == Durability.SKIP_WAL) {

recordMutationWithoutWal(m.getFamilyCellMap());

continue;

}

long nonceGroup = batchOp.getNonceGroup(i), nonce = batchOp.getNonce(i);

// In replay, the batch may contain multiple nonces. If so, write WALEdit for each.

// Given how nonces are originally written, these should be contiguous.

// They don't have to be, it will still work, just write more WALEdits than needed.

if (nonceGroup != currentNonceGroup || nonce != currentNonce) {

if (walEdit.size() >0) {

assert isInReplay;

if (!isInReplay) {

throw new IOException("Multiple nonces per batch and not in replay");

}

// txid should always increase, so having the one from the last call is ok.

// we use HLogKey here instead of WALKey directly to support legacy coprocessors.

walKey =new ReplayHLogKey(this.getRegionInfo().getEncodedNameAsBytes(),

this.htableDescriptor.getTableName(), now, m.getClusterIds(),

currentNonceGroup, currentNonce, mvcc);

txid =this.wal.append(this.htableDescriptor, this.getRegionInfo(), walKey,

walEdit, true);

walEdit =new WALEdit(isInReplay);

walKey =null;

}

currentNonceGroup = nonceGroup;

currentNonce = nonce;

}

// Add WAL edits by CP

WALEdit fromCP = batchOp.walEditsFromCoprocessors[i];

if (fromCP !=null) {

for (Cell cell : fromCP.getCells()) {

walEdit.add(cell);

}

}

addFamilyMapToWALEdit(familyMaps[i], walEdit);

}

// -------------------------

// STEP 4. Append the final edit to WAL. Do not sync wal.

// -------------------------

Mutation mutation = batchOp.getMutation(firstIndex);

if (isInReplay) {

// use wal key from the original

walKey =new ReplayHLogKey(this.getRegionInfo().getEncodedNameAsBytes(),

this.htableDescriptor.getTableName(), WALKey.NO_SEQUENCE_ID, now,

mutation.getClusterIds(), currentNonceGroup, currentNonce, mvcc);

long replaySeqId = batchOp.getReplaySequenceId();

walKey.setOrigLogSeqNum(replaySeqId);

}

if (walEdit.size() >0) {

if (!isInReplay) {

// we use HLogKey here instead of WALKey directly to support legacy coprocessors.

walKey =new HLogKey(this.getRegionInfo().getEncodedNameAsBytes(),

this.htableDescriptor.getTableName(), WALKey.NO_SEQUENCE_ID, now,

mutation.getClusterIds(), currentNonceGroup, currentNonce, mvcc);

}

//将waledit写入RingBufferTruck缓存中，并获取到txid

txid =this.wal.append(this.htableDescriptor, this.getRegionInfo(), walKey, walEdit, true);

}

// ------------------------------------

// Acquire the latest mvcc number

// ----------------------------------

// 获取最新的mvcc

if (walKey ==null) {

// If this is a skip wal operation just get the read point from mvcc

walKey =this.appendEmptyEdit(this.wal);

}

if (!isInReplay) {

writeEntry = walKey.getWriteEntry();

mvccNum = writeEntry.getWriteNumber();

}else {

mvccNum = batchOp.getReplaySequenceId();

}

// ------------------------------------

// STEP 5. Write back to memstore

// Write to memstore. It is ok to write to memstore

// first without syncing the WAL because we do not roll

// forward the memstore MVCC. The MVCC will be moved up when

// the complete operation is done. These changes are not yet

// visible to scanners till we update the MVCC. The MVCC is

// moved only when the sync is complete.

// ----------------------------------

for (int i = firstIndex; i < lastIndexExclusive; i++) {

if (batchOp.retCodeDetails[i].getOperationStatusCode()

!= OperationStatusCode.NOT_RUN) {

continue;

}

doRollBackMemstore =true; // If we have a failure, we need to clean what we wrote

addedSize += applyFamilyMapToMemstore(familyMaps[i], mvccNum, isInReplay);

}

// -------------------------------

// STEP 6. Release row locks, etc.

// -------------------------------

//释放锁

if (locked) {

this.updatesLock.readLock().unlock();

locked =false;

}

releaseRowLocks(acquiredRowLocks);

// -------------------------

// STEP 7. Sync wal.

// -------------------------

// 如果sync失败则会抛出IOException异常，从而直接跳到finally中并回滚已写入memstore中的数据。

if (txid !=0) {

syncOrDefer(txid, durability);

}

doRollBackMemstore =false;

// calling the post CP hook for batch mutation

if (!isInReplay &&coprocessorHost !=null) {

MiniBatchOperationInProgress miniBatchOp =

new MiniBatchOperationInProgress(batchOp.getMutationsForCoprocs(),

batchOp.retCodeDetails, batchOp.walEditsFromCoprocessors, firstIndex, lastIndexExclusive);

coprocessorHost.postBatchMutate(miniBatchOp);

}

// ------------------------------------------------------------------

// STEP 8. Advance mvcc. This will make this put visible to scanners and getters.

// 此操作成功之后，写入的数据才可以被get、scan操作获取。

// ------------------------------------------------------------------

if (writeEntry !=null) {

mvcc.completeAndWait(writeEntry);

writeEntry =null;

}else if (isInReplay) {

// ensure that the sequence id of the region is at least as big as orig log seq id

mvcc.advanceTo(mvccNum);

}

for (int i = firstIndex; i < lastIndexExclusive; i ++) {

if (batchOp.retCodeDetails[i] == OperationStatus.NOT_RUN) {

batchOp.retCodeDetails[i] = OperationStatus.SUCCESS;

}

}

// ------------------------------------

// STEP 9. Run coprocessor post hooks. This should be done after the wal is

// synced so that the coprocessor contract is adhered to.

// ------------------------------------

if (!isInReplay &&coprocessorHost !=null) {

for (int i = firstIndex; i < lastIndexExclusive; i++) {

// only for successful puts

if (batchOp.retCodeDetails[i].getOperationStatusCode()

!= OperationStatusCode.SUCCESS) {

continue;

}

Mutation m = batchOp.getMutation(i);

if (minstanceof Put) {

coprocessorHost.postPut((Put) m, walEdit, m.getDurability());

}else {

coprocessorHost.postDelete((Delete) m, walEdit, m.getDurability());

}

}

}

success =true;

return addedSize;

}finally {

// if the wal sync was unsuccessful, remove keys from memstore

//sync 失败则将写入memstore的数据移除。

if (doRollBackMemstore) {

for (int j =0; j < familyMaps.length; j++) {

for(List cells:familyMaps[j].values()) {

rollbackMemstore(cells);

}

}

if (writeEntry !=null)mvcc.complete(writeEntry);

}else {

this.addAndGetGlobalMemstoreSize(addedSize);

if (writeEntry !=null) {

mvcc.completeAndWait(writeEntry);

}

}

if (locked) {

this.updatesLock.readLock().unlock();

}

releaseRowLocks(acquiredRowLocks);

// See if the column families were consistent through the whole thing.

// if they were then keep them. If they were not then pass a null.

// null will be treated as unknown.

// Total time taken might be involving Puts and Deletes.

// Split the time for puts and deletes based on the total number of Puts and Deletes.

if (noOfPuts >0) {

// There were some Puts in the batch.

if (this.metricsRegion !=null) {

this.metricsRegion.updatePut();

}

}

if (noOfDeletes >0) {

// There were some Deletes in the batch.

if (this.metricsRegion !=null) {

this.metricsRegion.updateDelete();

}

}

if (!success) {

for (int i = firstIndex; i < lastIndexExclusive; i++) {

if (batchOp.retCodeDetails[i].getOperationStatusCode() == OperationStatusCode.NOT_RUN) {

batchOp.retCodeDetails[i] = OperationStatus.FAILURE;

}

}

}

if (coprocessorHost !=null && !batchOp.isInReplay()) {

// call the coprocessor hook to do any finalization steps

// after the put is done

MiniBatchOperationInProgress miniBatchOp =

new MiniBatchOperationInProgress(batchOp.getMutationsForCoprocs(),

batchOp.retCodeDetails, batchOp.walEditsFromCoprocessors, firstIndex,

lastIndexExclusive);

coprocessorHost.postBatchMutateIndispensably(miniBatchOp, success);

}

batchOp.nextIndexToProcess = lastIndexExclusive;

}

}

今天的分享就到这，有看不明白的地方一定是我写的不够清楚，所有欢迎提任何问题以及改善方法。