Login

How would I write the CQL to get the most recent set of data from each row?

I'm investigating transitioning from MSSQL to Cassandra and am starting to grasp the concepts. Lots of research has help tremendously, but I haven't found answer to this (I know there must be a way):

CREATE TABLE WideData {
ID text,
Updated timestamp,
Title text,
ReportData text,
PRIMARY KEY (ID, Updated)
} WITH CLUSTERING ORDER (Updated DESC)

INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('aaa', NOW, 'Title', 'Blah blah blah blah')
INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('bbb', NOW, 'Title', 'Blah blah blah blah')

wait 1 minute:

INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('bbb', NOW, 'Title 2', 'Blah blah blah blah')

wait 3 minutes:

INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('aaa', NOW, 'Title 2', 'Blah blah blah blah')

wait 5 minutes:

INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('aaa', NOW, 'Title 3', 'Blah blah blah blah')

How would I write the CQL to get the most recent set of data from each row?

SELECT ID, Title FROM WideRow - gives me 5 rows, as it pivots the data for me.

Essentially I want the results for (SELECT ID, Title FROM WideRow WHERE .....) to be:

ID Title
aaa, Title3
bbb, Title2

Also, is there a way to get a count of the number of data sets in a wide row?

Essentially the equivalent of TSQL: SELECT ID, Count(*) FROM Table GROUP BY ID

ID Count
aaa 3
bbb 2

Thanks

Also, any references to learn more about these types of queries would also be appreciated.

With your current data model, you can only query the most-recent row by partition key. In your case, that is `ID`.

SELECT ID, Title FROM WideData WHERE ID='aaa' LIMIT 1

Since you have indicated your clustering order on `Updated` in DESCending order, the row with the most-recent `Updated` timestamp will be returned first.

Given your desired results, I'll go ahead and assume that you do not want to query each partition key individually. Cassandra only maintains CQL result set order by partition key. Also Cassandra does not support aggregation. So there really is no way to get the "most recent" for all of your `ID`s together at once, nor is there a way to get a report of how many updates each `ID` has.

With Cassandra data modeling, you need to build your tables to suit your queries. Query "planning" is not really a strong point of Cassandra (as you are finding out). To get the most-recent updates by `ID`, you would need to build an additional query table designed to store only the most-recent update for each ID. Likewise, to get the count of updates for each `ID` you could create an additonal query table using [counter coulmns][1] to suit that query.

**tl;dr**

In Cassandra, denormalization and redundant data storage is the key. For some applications, you might have one table for each query you need to support...and that's ok.

[1]:

[To see links please register here]

lexicology536

yardstick676