i have set of json messages. each has unique id field. messages parents of other messages, e.g.
{"id":"idx", ..., "parents":["ida", "idb", "idc"]} each message has other fields not relevant question.
each message stored in file on aws-s3. have dynamodb table uses id key , contains info on corresponding message stored on s3, including path file , byte offsets of message enable direct read.
given id0, need generate full tree replacing id of each parent entire contents of message, recursively. example below expanded this:
{"id":"idx", ..., "parentcontents":[{"id":"ida", ...}, {"id":"idb", ...}, {"id":"idc", ...}]} parents of id{abc} expanded same way etc. @ end end single tree-shaped json message starting id0 root.
the current algorithm (in python-like pseudocode):
def get_message(id): # 1. location info id dynamodb # 2. message id (mjson) using location info direct read s3 file # 3. parse json , find list of parent ids (parentids) # 4. recusrively expand parents: pid in parentids: pjson = get_message(pid) # 4.1. insert pjson mjson # 5. return full tree id return mjson so build tree in depth-first fashion. implementation slow large trees. need speed up.
i have not done detailed profiling of code suspect 1 bottleneck single-item reads dynamodb @ step #4. played bit batch_get_item noticed that, unlike get_item not throw exception when 1 of keys in list not exist in table. is there different batch-read method complain not-found keys?
i thinking of using batch_get_item or similar , refactor code read parent records given id dynamodb @ once. do think might speed things lot?
is there way batch direct read multiple s3 files? currently, using following block single read:
key = s3bkt.lookup(infile) line = key.get_contents_as_string(headers={"range":"bytes="+str(fromoffset)+"-"+str(tooffset)}) also, suspect other bottlenecks?
thanks!
Comments
Post a Comment