algorithm - Efficient search in a corpus -


I have a few million words that I want to find in a billion words carpus. What is the effective way to do this?

I'm thinking of a trio, but is there an open source implementation of available tree available?

Thank you

> - Updated -

I should get some more information that is really necessary.

We have a system where we crawl news sources and receive popular words based on frequency

Word 1 Frequency 1 Word 2 Frequency 2 (Tabs) Delimited)

We have also found the most popular words. ($ 1 billion) is from another source, which includes data in the above format.

This is what I want to get in the form of output.

  • Common word formulas for both words
  • The words are present only in our source but not in the reference source.
  • There are only words present in the reference source but not in our source.

I can use com (Bash command) only for the words in the above information. I do not know how to use com to compare compared to one column compared to both columns.

The system must be scalable and we want to do it on a daily basis and compare the results. I would also like to get the approximate match

So, I'm thinking of writing a map to reduce a job. I am planning to write a map and reduce the tasks below, but I have some questions.

  Map output key for each word = term and value = structure {filename, frequency} done, decrease the key for each, repeat through all the values, and check whether both file1 and file2 If so, write it in the appropriate file. If only file1, write it to the file, only file if only in file2, file 2 to file it was written.  

I have two questions in the map, I can provide a directory in the form of input that has two files in it. I do not know how to get the file name, from which I am reading the word. How to get this information? You can type in different output files, because to reduce the phase you automatically write in the default file named part-xxxxx. How to write different output files.

Thanks for reading it. MapReduce with

You should try and do everything in one step or job. It seems that you should split this problem into several steps. Since you are generating the data stored on HDFS, and you have to know the source you might want to go to for any format:

{SOURCE}, {WORD}, {FREQUENCY} < / P>

Remember that you are talking about a distributed file system, so mentioning your input in file 1 and file 2 is not technically correct, your reference data and Both the source data, each of the pieces on each node With He will spread throughout the cluster.

Next, to begin with your pseudo code example, you will need to create a job that will source a word and its frequency your mapper will work properly, but to reduce the formula to add words to the words Will be required. You will need to create your own handwriting in which the map is included & lt; Source, frequency> This will output to HDFS because intermediate data can work with your follow-on filter jobs.

You can use the output from this step because they have 3 different map improvement jobs for inputs where different combinations of each source are looking for these jobs will be very easy, because the Mapper bus is a Only data passes through, but Reduce will check each value for different combinations of sources.

So if you take this approach you will need MapReduce 4 jobs. You do not have to run every one by hand, you can do the same job that runs every job sequentially. Alternatively, the last 3 jobs are using the same input data, since you can start those three at the same time after the first expiration. This will probably depend on the amount of data and intermediate data that is capable of managing your cluster, and each task will require the number of mapper / reducer.

Hope this suggestion helps.


Comments

Popular posts from this blog

oracle - The fastest way to check if some records in a database table? -

php - multilevel menu with multilevel array -

jQuery UI: Datepicker month format -