amazon s3 - how to speed up building JSON tree using DynamoDB and S3 in Python? -


i have set of json messages. each has unique id field. messages parents of other messages, e.g.

{"id":"idx", ..., "parents":["ida", "idb", "idc"]} 

each message has other fields not relevant question.

each message stored in file on aws-s3. have dynamodb table uses id key , contains info on corresponding message stored on s3, including path file , byte offsets of message enable direct read.

given id0, need generate full tree replacing id of each parent entire contents of message, recursively. example below expanded this:

{"id":"idx", ..., "parentcontents":[{"id":"ida", ...}, {"id":"idb", ...}, {"id":"idc", ...}]} 

parents of id{abc} expanded same way etc. @ end end single tree-shaped json message starting id0 root.

the current algorithm (in python-like pseudocode):

def get_message(id):      # 1. location info id dynamodb     # 2. message id (mjson) using location info direct read s3 file     # 3. parse json , find list of parent ids (parentids)     # 4. recusrively expand parents:     pid in parentids:         pjson = get_message(pid)         # 4.1. insert pjson mjson     # 5. return full tree id     return mjson  

so build tree in depth-first fashion. implementation slow large trees. need speed up.

i have not done detailed profiling of code suspect 1 bottleneck single-item reads dynamodb @ step #4. played bit batch_get_item noticed that, unlike get_item not throw exception when 1 of keys in list not exist in table. is there different batch-read method complain not-found keys?

i thinking of using batch_get_item or similar , refactor code read parent records given id dynamodb @ once. do think might speed things lot?

is there way batch direct read multiple s3 files? currently, using following block single read:

key = s3bkt.lookup(infile) line = key.get_contents_as_string(headers={"range":"bytes="+str(fromoffset)+"-"+str(tooffset)}) 

also, suspect other bottlenecks?

thanks!


Comments