java - Tracking unique identifiers over a large dataset -


i've standalone java applications operates on large amount of elements read input file, each element being associated identifier. each element, following (among others of course):

  • check element has not been processed using identifier.
  • map element grid using statistical method, each cell of grid being responsible tracking unique elements assigned it, along properties calculated on each element.

the number of elements might quite large (several millions), grid itself. each cell created on fly element has been assigned avoid storing empty cells.

question is: large amount of data, memory issues naturally arise. best strategy process large amount of data while avoiding memory issues ?

i've couple of things in mind, i'd know if has had kind of problem, , if so, share experience:

  • embedded lightweight sql database
  • caching solutions such ehcache or apache jcs
  • nosql key-value stores such cassandra

thoughts ?


Comments